arXiv; math . ST/0806 . 0729 



High dimensional gaussian classification 



Robin Girard^^ 

LJK, Grenoble, France 

Abstract: High dimensional data analysis is known to be as a challenging 
problem (see [111]). In this article, we give a theoretical analysis of high 
dimensional classification of Gaussian data which relies on a geometrical 
analysis of the error measure. It links a problem of classification with a 
problem of nonparametric regression. We give an algorithm designed for 
high dimensional data which appears straightforward in the light of our the- 
oretical work, together with the thresholding estimation theory. We finally 
attempt to give a general treatment of the problem that can be extended 
to frameworks other than gaussian. 
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1. Introduction 

Let X he & vector space, typically X = W but X can also be an infinite di- 
mensional polish space (i.e: separable complete metric space). In Section 8 X 
is a separable Banach space. In the binary classification problem, the aim is to 
recover the unknown class y € {0, 1} associated with an observation x £ X. In 
other words, we seek a classification rule (also called classifier), i.e a measurable 
g : X ^ {Ojl}- This rule gives an incorrect classification for the observation 
X if g{x) 7^ y. The underlying probabilistic model, that makes a performance 
measure of g possible, is set by distributions {k = 0, 1) on X. For fc = 0, 1, 
the distribution Pf- is the distribution of the data having label equal to fc. In this 
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framework, the weighted sum of the probabihtics of misclassification is defined 

by 

C(7r, g) = ^Pi{g{X) ^ 1) + (1 - ^)P^{g{X) ^ 0). (1) 

In a bayesian framework, the weight tt reflects the marginal distribution of 
the label Y . In our approach, we do not want this marginal distribution to set 
the importance of the different errors. In the many applications we have in mind, 
such as tumour detection from an MRI signal, the class that appears most fre- 
quently is not necessarily the one for which a classification error has the most 
important medical consequences. This is the reason why we search a procedure 
g that minimise Cijr^g) and not its bayesian counterpart : P{g{X) = Y). 

Here, we do not want to study the influence of the weight tt in the problem. 
The main reason is that our results, to be given later, are simpler to formulate 
and to understand when tt = 1/2, and that the problem we are interested in is 
the problem that rise from the high dimension of the space A", and not the prob- 
lem related to the use of tt. Therefore, in the rest of the present paper we will 
make the assumption that tt = 1/2. In the sequel, we will set C{g) = C{l/2,g). 
This is a usual assumption (see for example Bickel and Levina [()]) 

In the case where tt = 1/2 it is known that, if Pq and Pi are equivalent, then 
the rule that minimises C{g) is given by 

g*{x)^lv, V^{xeX : Cio{x)>0} where £10 = log (^^^ (2) 

is the logarithm of the likekihood ratio between Pi and Pq (i.e the Radon- 
Nikodym derivative). 

In real life problems, £10 is unknown, and the only thing we have is a sub- 
stitute £10 of it. Also, it is natural to plug it in (2) and to use the classifier 

g{x) = ly{x) and f = |.t G A" : £10 > o| . 

The natural question that we will investigate in this article is the following: 

Problem 1. Is there a simple way to relate the excess risk C{g) — C{g*) to a 
measure of the log-likelihood "perturbation": Ciq — Ciq. 

In other words we seek an upper bound and a lower bound of C{g) — C{g*) 
by a simple-to-study real valued function of £10 — £10- In this article we focus 
on the gaussian case, and unless the contrary is explicitly stated, Pi and Pq will 
be gaussian equivalent probabilities on X. We investigate Problem 1 and the 
answer we obtain in the general case leads to the bound 

C(g)-C(.9*)<c(r)i|£io-£io|li/(^) 

while ||£io||l2(7) ^ > foi' ^ gaussian measure 7, where c(r) is a constant only 
depending on r. In some particular cases (when £10 — £10 and £10 are affine) 
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we are able to give an explicit constant c(£io) and an exponent higher than 1/6 
(exponent 1). 

If we suppose that Po a-nd Pi have equal covariance, then it is known that £10 
is affine and it is natural to take an afline £10. The corresponding procedure is 
usually called Linear Discriminant Analysis (LDA) (even if the underlying pro- 
cedure is affine). If we suppose that Pq and Pi have different covariance, then 
£10 is quadratic and it is natural to take a quadratic £10. The corresponding 
classification procedure will be called Quadratic Discriminant Analysis (QDA). 

The corresponding procedures are also known as plug-in procedures: £10 is 
plugged into (2) in order to obtain g. Plug-in procedure have been studied in 
a different context (see for example [3] and the references therein), but our ap- 
proach differs from those. 

The interest of Problem f in the gaussian setting, is understood by addressing 
the problem of finding a good substitute £10 for £10. For example, in many 
applications, we are given a learning set consisting of n random variables drawn 
independently from Pi and n' drawn from Pq. The problem of finding a good 
substitute £10 of £10 then becomes an estimation problem whose error measure 
is given in the answer to Problem f . Also, our answer to Problem 1 given below 
gives rise to a natural way to estimate £10 in high dimension, which is the 
answer to what we call Problem 2: 

Problem 2. Given a learning set, construct £10 in order to get a satisfac- 
tory classification procedure in high dimension: a procedure that can he justified 
theoretically and with numerical experiment. 

Classical methods of classification break down when the dimensionality is 
extremely large. For example. Bickel and Levina [(i] have studied the poor per- 
formances of Fisher discriminant analysis. Although, the number of parameters 
to learn in order to build a classification rule seems to be responsible for the poor 
performance. In the sequel we shall give theoretical non-asymptotic results that 
emphasise this poor performances. To overcome the poor performance Bickel 
and Levina [fi] propose to use a rule which relies on feature independence. Fan 
and Fan [ I 2] propose to select the interesting features with a multiple testing 
procedure. Bickel and Levina give a theoretical study of a particular LDA pro- 
cedure (i.e a LDA procedure based on a particular estimator £10), they do not 
study the QDA procedure. 

The selection of interesting features constitutes a reduction of the dimension 
of the space on which the classification rule acts. Feature selection is widely used 
in high dimensional classification, the procedures used for selection of interest- 
ing features are often motivated by theoretical results (see [12]). Unfortunately, 
these theoretical results are based on the following two postulates. On the one 
hand, features can be a priori divided into two parts, an interesting one and 
a non interesting one. On the other hand, selecting the interesting features is 
necessary and sufficient to get a good classification rule. If we accept that these 
postulates reflect nothing but a relatively clear intuition, we would like to give 
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an analysis of the classification risk in order to justify a feature selection method 
based on multiple hypothesis testing. 

Thresholding techniques are widely used in the non-parametric regression 
framework (see [9] for an introduction to the thresholding techniques), and as 
we shall see, the techniques can be used to give an answer to Problem 2. Also 
we believe that our answer to Problem 1 will shed light on the simple link that 
exists between the nonparametric regression and the classification problem. 

Functional data analysis is the study of data that lives in an infinite dimen- 
sional functional space. Hence curve classification is one of the problems it deals 
with. Since [17], functional data analysis has undergone further developments 
and especially in the context of classification (see for example [5] and the ref- 
erences therein). In the gaussian setting, it is rather natural to expect results 
that are dimensionless and that can be applied to any abstract polish space. 
Hence, our answer to problem 1 will be given in terms of ^2(7) norms, with 
7 a gaussian measure, and since the constant involved in our theoretical result 
does not depend on the dimension, the extension from X = W to more abstract 
spaces is straightforward. 

Let us introduce some notation. In the whole article, 7c,;i is a gaussian mea- 
sure on X with mean jjL and covariance C, 7c is the zero mean gaussian measure 
with covariance C and 7p is the gaussian measure on W with mean zero and 
covariance Idmp] '^{x) is the cumulative distribution function of a real gaussian 
random variable with mean zero and variance one. If 7 is a probability measure 
on R^', ||H;^e|ji2(7) ^^^^ norm of the orthogonal projection in ^2(7) of 

the vector e S ^2(7) on the hyper-plan orthogonal to x e ^2(7); if F e 
11-^11^2(7) '^il^ t)e the norm of the hnear application a; G — > {F,x)^p. We shall 
use both the fact that if F G and 7 is a gaussian measure with mean zero 
and covariance C, then ||-F||l2(7) ^ I|C'^^^-F||rp; and that ||-F'||l2(7) ^ natural 
measure that can be extended in an infinite dimensional framework. The sym- 
metric difference between two subsets of X A and B is denoted by AAB, it is 
the set of all elements that are in A \ i? or in i? \ A. If A is a matrix of MP 
\\A\\hs will be the Hilbert-Schmidt norm of the matrix A, trace{A) the trace of 
A, and qA{x) will be given by {Ax,x)m.p for all x G W. 

This article is organized as follows. We give the main theoretical results - 
leading to a solution to Problem 1- for the LDA procedure in Section 2, and 
for the QDA procedure in Section 3. In section 4 we give our algorithm for 
high dimensional data classification and the theoretical result related to it. This 
leads to our contribution to Problem 2 in the light of our solution to Problem 
1. In Section 5 we apply this algorithm to curve classification. In Section 6 we 
introduce a geometric measure of error and derive its link with the excess risk. 
Section 7 is devoted to the proof of results given in Section 2 and Section 8, to 
the proof of results given in Section 3 and possible generalisations. 
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2. Affine perturbation of affine rules 
2.1. An solution to Problem 1 

2.1.1. Main result 

In this section. X ~ MP , C is a symmetric definite positive matrix and Pi ~ j^-^^c 
Pq = 7mo,c- Under these hypotheses Cw(x) = ^toi-"^) affine on MP: 

^loi^) = {Fw,x- sio)bp where sio = ^ ^° , Fio = C^^mw (3) 

and mio = /ii — /io- In this section, we restrict ourselves to an affine substitute 
vC^o(x), we note Fio and sio the corresponding substitutes of Fio and sio- We 
then decide that X comes from Pi if it is in 

y=:|xeMf st£fo(a;)>o}. (4) 
One can define the angle a in ^2(70) between Fiq and _Fio by 

_ / l|nf^-LfioHL2(7c)H-^lolU2(7c) ^ . , 

Of. — cLrCtRn I I . ( ) 

This angle will play a very important role in the sequel. We obtained the fol- 
lowing solution to Problem 1. 

Theorem 2.1. Let Fio o.'^'d sio be two MP vectors and £^0(0;) defined by sub- 
stituting FiQ and sio for Fiq and sio in (3). Let Pi and Pq be two gaussian 
measures on X = MP with the same covariance C with means respectively fii 
and ^0- 

If V is the MP subset defined by (4), we have: 

C{ly)^C{lv)< ^ 



^1o||l2(7c) 



where 



£ = ( ^fg"^^^""^ l(Ao,^io - ^loM + \\Fio Ao||l2(7c) ) • (6) 

\V1'II-^10||l2(7c) / 

//|(Fio,sio-sio)kp| < 3:l(Ao,^io>L2(7c)l anda < 7r/4 (a is defined by (5)), 
then 

C{ly) - C{lv) < e ^^rf • (7) 

ll-f^l0||L2(7c) 

The proof of this theorem is given in Section 7 at Sub-section 7.4. It is a 
consequence of Theorem 7.1 obtained by simple geometric methods emphasizing 



imsart-generic ver. 2007/12/10 file: article-f inall . tex date: July 10, 2008 



R. Girard/High dimensional gaussian classification 



6 



the fact that Pi^{X ^ V \ V) is the measure of an area between two hyperplans 
obtained by a rotation of angle a. The proof also uses the inequality 

C{Jy) - C(]|y) < i (pi(x e y \ f ) + Po(X e 1> \ y)) = Tl{Jy), (8) 

which defines TZ{ly). We call Ti{ly) the learning error, it is the probability of 
making a a wrong classification with g{x) = (x) and a good classification with 
the optimal rule g* = Iv- We will use and motivate more deeply this measure 
of error in Section 6. Let us now give comments on Theorem 2.1. 



2.1.2. General comments 



If we note 

<5 = Ao - ^10 and 4 = (Ao, sio - sio)k3' , (9) 

we have 

£io(x) = Cin[x) + {5,x - sio)ep + rfo- 

Also, in the sequel we will talk about affine perturbation of the optimal rule. 
The preceding theorem results from the study of affine perturbations of afhne 
rules. 

The case where do = will be studied later but we can already note that in this 
case, Theorem 2.1 yields 

II^10||L2(7c,.io) 

which is a nice answer to Problem 1. In the sequel (see Section 7 Theorem 7.1), 
we shall see that it is optimal whenever j|£io||i,2(^p ^^^) does not become to large. 

The quantity r = ||-Fio||l2(7c) measures the theoretical separation of the data. 
Indeed it is the Li distance between Pi and Pq, defined by di{Pi, Pq) = / \dPi — 
dPol that measures this separation: it is known that di{Pi, Pq) = (1 — 2C{lv)), 
which implies 

di(Pi,Po) = $(^-^r^ -<i>Qr 

Also, di(Pi, Pq) ~ r when r —^ 0, and then the data cannot be distinguished by 
any rule. The data tends to be perfectly separated when di{Pi, Pq) 1. In this 
case, r — )■ oo and 

di(Pi,Po)~l 



rv 27r 

Also note that in the infinite dimensional setting two gaussian measures Pq 
and Pi are either orthogonal (there exists a Borelian set A such that Pi{A) = 
Pq{X \A) = ) or equivalent (i.e mutually absolutely continuous) and the latter 
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case appears if and only if r is finite. 

Although, if £ measures the estimation error, 



--— and e 32 (10) 

II^10||L2(7c) 

in the upper bounds (6) and (7), arc linked with the proximity of the measures 
Pq and Pi. When ||-F'io|li2(7c) ^® large, data are well separated and the terms in 
(10) measure the impact of this separation on the excess risk. We believe that 
when ||-Fio|li,,(-,^) tends to 0, yp^^n^ ^ — - is linked to the error measure TZ{lv) 
used in the proof (defined by (8)). Indeed, it is not correct to think that the 
classification problem is harder (in the sense of the excess risk) when data are 
not well separated: straightforward computation leads to 

yVcRP C{ly)-C{g*)<\di{Pi,Po). 

As we shall see in the sequel (see Theorem 6.1) TZ{lv) behaves almost like the 
excess risk if and only if di{Po, Pi) does not tend to 0. 

The learning set has to be used to elaborate estimators Fio and siq of Fiq 
and siQ. The preceding theorem allows us to quantify what intuition clearly 
indicates: a good estimation of the parameters Fio and siq (or more indirectly 
/ii, ^0 and C) leads to a good classification rule. These estimators must lead to 
a small excess risk and by the preceding theorem 

Ep«„[C(V) -C(]lv)] < J^!"^^^ , (11) 

where P"^" is the learning set distribution. 

It seems that little is known on theoretical behaviour of the LDA procedure (a 
plug-in procedure) with respect to the optimal rule (the Bayes rule) . The result 
that is classically used (see for example Anderson and Bahadur [2]) to show the 
consistency of a LDA rule using estimators ^Fio = C^^rhio = C^^{(li — fio) and 
sio = (Ai + Ao)/2 is that the probability to observe X 7c,aio (ii^ that case X 
comes from class 0) falling into V (and affect it to class 1) is 



P 



((Ao^cV^Om. > (^10 - MO, Ao)m.|-4) = 1 - $ ( , (12) 

^ ^ V !I^10||l2(Rp,7c) / 



where A is the cr-field generated by the learning set, and ^ is a centered gaussian 
random vector of MP with covariance Idwip- Note that the proof of (12) follows 
from a straightforward calculation. We believe that a direct analysis of this error 
term misses the geometrical aspect of the problem. In addition, this error has 
to be compared with the lowest possible error C{g*). Note that for the LDA 
procedure in a high dimensional framework, an analysis of the worst case excess 
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risk has been done with (12) by Bickcl and Levina [(>] for a particular choice of 
Fio and sio- Our Theorem, because it is intrinsic to the classification procedure, 
is singularly different from the type of result that they obtain. In particular, it 
will allow us to establish a revealing link between dimensionality reduction and 
thresholding estimation. 



2.1.3. The constant part of the perturbation 

The error due to the constant part of the perturbation (do in equation (9)), is 
measured by 



10 



^10||l2(7) 



sio — sio 



In order to give a first simple analysis of this term, we are going to suppose that 
Fif) and sio are independent. This independence can be obtained by keeping a 
part of the learning set for the estimation of Fiq and a part for the estimation of 
siQ. In thisat case, if n' observations of the learning set were used to construct 
sio; and if sio ~ {p-i + Mo)/2 {p-i is the empirical mean of the observations of 
group i), then, straightforward calculation leads to 



E 



-KAo, 



10||L2(7) 



< 



Ultimately, the difficulty of the problem does not come from the constant part 
of the perturbation, but from the linear part. 

The conditions under which the second inequality (7) of the theorem is given 
shall easily be satisfied. The second condition is that a < ^. It is not difficult 
to satisfy if Fio and Fio are close enough to each other. The first one is verified 
if the second is and if we have: 



10 



, sio — SlO 



10||L2(7C) 



< -^II^10||l2(7c)- 



If for example siq = (fix + /^o)/2 and the learning set is composed of n' observa- 
tions uniquely used for the estimation of sio, then, given the rest of the learning 
-, siQ — sio)rp ^ 7 J- and the preceding condition is satisfied with 



Fio 



set, 

■ Il-''10||I,2(TC) 

probability 



( ^I|J^10|U2(7C)'^' 



2.1.4- The linear part of the perturbation 



As we shall explain in the proof of Theorem 2.1, the angle a defined by (5) 
measures quite well the error due to the linear part of the perturbation. Also, 
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the upper bound given in the preceding theorem is not sharp everywhere. Indeed, 
if /3 £ M, and Fio = (3Fio, the error TZ{lv) is null and the bound (6) can be 
arbitrarily large. We believe that the study of methods designed to estimate 
direction (parameter on the sphere S^"^) in a high dimensional setting are 
required. We only want to give the link between the problem of estimating 
Fio as a vector of MP and the problem of estimating Fiq in order to get small 
C(]|y). In addition, this invariance of the error under dilatation only exists in 
the direction Fiq which is unknown and is seems to be quite tricky to make a 
direct use of it. 

Let us give a simple example to illustrate the interest of the link between 
estimation and learning. 

Example 2.1. Let <t > 0, suppose X ^ ^ij p_^^ , C = Ip and that Sio is known. 
In the estimation problem of Fio for classification we wish to recover Fiq from 
the observation X and the error is measured by 

^ ~ 1I^10||l2(7c) II-FioIIrp 

In Example 2.1 the problem is exactly the one we encounter in the regression 
framework, while estimating Fiq from p noisy observations of (fioH)i=i,...,p 
with an error measured with a P norm. Suppose now that we want to let p 
grow to infinity. If the coefficients of Fio decrease sufficiently fast, for example 
if Fio € l'^{R) with q < 2, then (see for example [()]), it is possible to obtain a 
good statistical estimation of Fio by setting to zero the coefficient that are are, 
in absolute value, under a threshold. It is a thresholding estimation and we shall 
use this type of procedure in Section 4. In the case where we observe X from 
the distribution ^c/n.rmo i^^ equivalently X*, i = 0, 1, from the distribution 
72C/n.iii) and if C ^ Ip is known, the problem can be reduced to the preceding 
particular case thanks to the transformation x C~^^'^x. When C is unknown, 
the parallel with the estimation framework is more delicate because the error £ 
depends on C. 

Remark 2.1. Replacing coefficients by zero in the regression framework of Ex- 
ample 2.1 is equivalent to reducing the dimension of the space on which the 
chosen classification rule acts. Selecting the significant coefficients of Fiq is 
equivalent to finding the direction S W for which \{C~'^^'^{iii — fJ-o): ^i)Rp\'^ 
is large. This is almost equivalent to finding the direction in which a theoretical 
version of the ratio between inter-variance and intra-variance is big. This type of 
heuristic with empirical quantities has been used by Fisher whose strategy 
is to maximize the Rayleigh quotient (see for example [14])- The point is that 
the use of empirical quantities in high dimension can be catastrophic (see next 
subsection). 

2.2. Procedures to avoid in high dimension 

We are going to give two results that will lead to the following precepts in the 
problem of estimating Ciq. While giving a solution to Problem 2, 
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1. one should not try to estimate the fuU covariance matrix C from the data, 

2. one should restrict the possible values of mio to a (sufficiently small) 
subset of RP. 

These precepts have been known for some time, but we give precise non-asymptotic 
results emphasising them. The first fact is a consequence of Proposition 2.1 be- 
low while the second one results from Proposition 2.2. 

These two proposition arise from the use of a more geometric error measure, 
the learning error 7?,, which has already been defined by (8) and which shall 
be studied in more detail in Section 6. In fact it is an easy geometric exercise, 
for one who knows a little on gaussian measure, to obtain the following lower 
bound 

7^(M>ge (13) 

(which is the last point of Theorem 7.1 in Section 7) where a, the angle in 
^2(70) between i^io and i^io, is defined by (5). On the other hand. Theorem 6.1 
from Section 6 leads to 

C(,)-C(,-)>min|^|^C-V^„.,o|k.e '"""^"'-- 7^(,)^^l 



for all measurable 17 : A" ^ {0, 1}. Also, it suffices to get a lower bound on the 
Learning error TZ{ly) by the use of (13) to get (a good) lower bound on the 
excess Risk when (ii(Po: A) cannot be as closed as desired from zero. This is 
what we shall do. For the case where the distributions Pi and Pa are almost 
undistinguishable (c?i(Pi, Pq) — > 0) we refer to the discussion in Section 6. 



2.2.1. One should not try to identify the correlation structure 

Let us recall that if A is a definite positive matrix, one can define its generalised 
inverse, also called Moore-Penrose pseudo-inverse: C~ . This generalised inverse 
C~ arises from the decomposition = Ker{C) © Ker(C)-^ . On Ker{C), C~ 
is null, and on Ker{C)^ , C" equals the inverse of (7 = C\Ker{c)^ ( i-6 is the 
restriction of C to Ker{C)^). 

Proposition 2.1. Suppose we are given Xi, . . . , X„ drawn independently from a 
gaussian Probability distribution P with mean zero and covariance C on W . Let 
C be the empirical covariance and C~ its generalised inverse. If Fio = C~mio 
and sio = Sio, the classification rule ly defined by (4) leads to 

arccos 

Ep»,^[7^(%)] > - 



e 8 



2tt 

Before we prove this proposition, let us comment it in few words. 
Comment. As a particular application of this proposition, we see that the 
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Fisher rule performs badly when p » n, which was already given in [ti], but 
in a different form (asymptotic and not in a direct comparison of the risk with 
the Bayes risk). Many alternatives to the estimation of the correlation structure 
can be used, based for example on approximation theory of covariance opera- 
tors, together with model selection procedure or more sophisticated aggregation 
procedure. Much work has already been done in this direction, see for example 
[7] and the references therein. The approximation procedure has to be linked 
with a statistical hypothesis, as it is in the case when stationarity assumptions 
are made that lead to a Toeplitz covariance matrix C (i.e Cij = c{i — j) with 
c : Z ^ R a p-perioric sequence). These matrices are circular convolution oper- 
ators and are diagonal in the discrete Fourier Basis (.g"')o<m<p where 

This is roughly the type of harmonic analysis that is used in Bickcl and Lev- 
ina [G] and combined with an approximation in [21]. Under assumption such as 
commutation (or quasi-commutation) of the covariance with a given family of 
projections, the covariance matrix can be search in the set of operator given 
by a spectral density. This leads to a huge reduction of the parameters to esti- 
mate. Let us finally notice that the use of harmonic analysis of stationarity in 
curve classification becomes very interesting when one considers the larger class 
of group stationary- processes (see [20]) or semi-group stationary processes (see 
[16]). 

Proof. The proof is based on ideas from Bickel and Levina [(i] used in their The- 
orem 1: if C is the identity their exist ■ ■ ■ ,£,p, p valued random variables 
forming an orthonormal basis of W, a random vector (Ai, . . . , A„) of M" whose 
property are the following. 

1. The A,; are independent of each other, independent of (Cj)i=i,...,p, and nXi 
follows a distribution with n — 1 degrees of freedom. 

2. For every i, is drawn in an independent and uniform fashion on the 
intersection of the unitary sphere of MP and the orthogonal to ^1, ... , <^,;_i. 

3. The empirical estimator C oi C satisfies: 

n 
i=l 

where ii x,y £ MP, x ^ y is the linear operator of M^ that associates to 
z gMP the vector (x, z)^py. 

When C does not necessarily equal /p, we get, 7c— almost-surely: 

n n 
^-1/2(5^-1/2 ^ ^^^^ ^ ^1/2(^-^1/2 = ^ ® 



imsart-generic ver. 2007/12/10 file: article-f inall . tex date: July 10, 2008 



R. Girard/High dimensional gaussian classification 12 

Then, if we define = (C^^^^^iOj ^i)Mpj we liave tlie following equations 

A 
A. 

(14) 
(15) 



(C-i77iio,C'-mio)L,(^^) = (C"i/2mio,Ci/2C'-Ci/2C-i/2mio)KP ^Y.T 



I-^1oIIl2(7c) 



i=i ^ i=i 



For reasons of symmetry (the are drawn uniformly on the sphere), we have 
for all subsets /„ from { 1 , . . . , p} of size n : 



E 





= E 






[ ELi A J 


[el 





and we obtain 



n 
P' 



(16) 



From equations (14) and (15), the expectation of the angle a between i^io and 
Fio in ^2(70) (defined by 5) is 



Eflall =E 



> E 



arccos 



arccos 



1^1=1 Xi 



(definition of a) 



Eti A 



ELiA 

( Cauchy-Schwartz inequality and function arccos is decreasing) 

e:IiA 

ELi A 

( Jensen inequality and concavity of arccos on [0, 1]) 



> arccos E 



> arccos \ \ j — ] (from (16)). 



This and inequality (13) lead to the desired result. 



□ 



2.2.2. One should not use a simple linear estimate to get Fiq. 

Proposition 2.2. Suppose that C is a positive definite matrix, and that we are 
given Xi, . . . , Xn drawn independently from a gaussian Probability distribution 
P with mean mio and covariance C on W . Let mio be the associated empirical 
mean. Let us take Fio = C^^ifiio and sio ~ siq. Then, the classification rule 
ly defined by (4) leads to 



arccos 

Ep«„[7^(ll^)] > 
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Before we give a proof, we comment this result briefly. 

Comment. Suppose there exists < r < i? such that R > H-Fiol 1^3(70) — ^' 
From the preceding proposition, uniformly on all the possible values of ^1 and 
/xo, the learning error and the excess risk can converge to zero only if ^ tends to 
0. Recall that if no a priori assumption is done on mio, rhiQ is the best estimator 
(according to the mean square error) of mig. Also, as in the estimation of a high 
dimensional vector problem (such as those described in ([9])), one should make a 
more restrictive hypothesis on toiq. We will suppose, in Section 5, that if (afc)fc>o 
are the coefficients of C~^/^mio in a well chosen basis, then X]fe>o — 
< g < 2. 



Proof. As in the preceding proposition, we will use inequality (13). Also it is 
sufficient to show the following 



E [lall > arccos 



1 



(\AI||^^io||l, 



(ic) 



r 



where a is defined by (5). Because the function arccos is decreasing and concave 
on [0, 1], it suffices to obtain 



E 



I^1o||l2(7c)II-P'io||l2(7c) 



< 



g (V^Hi^lo||L2(7c) + !)• 



(17) 



On the other hand, 

|(Fio,Fio)l2(7C 



E 



-F'io||l2(7c)II-^1o||l2(7c) 



< E 



< E 



l-^loH£2 (7c) 

lAoIlL 



^loi|L2(7c) 
-F'io||l2(7c) 



-, 1/2 






1 +E 



-E 



\{Fw,Fio - flo)L2(7c)l 
II^1o||l2(7c)II-^1o||l2(7c) 

^wIl2{ic) 



l-^10lli2(7c) 



where this last inequality results from Cauchy-Schwartz. Recall that 

C-1/2 



10 



where ^ is a standardised gaussian random vector of MP . Also, we easily obtain, 

1/2 

E 



^io)i2(. 



7c) 



I^10lli2(7c) 



1 



and 



^i"lli2(7c) 

^10|Il2(7c) 



\\^c^'^F,4l, 

|V^CV2Fio+^||i 



The rest of the proof follows from the following simple fact which is a con- 
sequence of the Cochran Theorem and a classical calculation on random 
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variables: 

Let a > P € W , X & gaussian random vector of with mean /? and 
covariance Ip. Then 



E 



1 

m 



< 



1 



□ 



2.3. Case where ||-Fio ||z,2(-yc) diverges: well separated data. 

We shaU now rapidly consider the case when the data are well separated: the 
case where ||i*"io||L2(7c) diverges. In the next theorem, we assume that p tends 
to infinity. 

Theorem 2.2. Suppose that < a < tt/2 (a is defined by (5)), and that 
cos(Q:)j|_F'io||L2(^p) oo when p tends to infinity. We then have 

r St liminfp^oo TTir^ T < 1 

This theorem is proved in Section 7. In the case of well separated data it 
is obvious that the optimal rule will perform perfectly. Theorem 2.2 shows 
that for a given estimator Fio one should check that the probability to have 
hminfp^oo ttt^ — §^ ; > 1 is small enough. 

3. Quadratic perturbation of quadratic rule 

3.1. Main results and remarks about the infinite dimensional setting 

In the case where Ci 7^ Cq, Ciq{x) = £^Q(a;) is a polynomial function of degree 
two on W: 

J^ioi^) = -^{^w{x - sw),x- sio)rp + (Gio,x - sio)rp - c, (18) 

where 

Aw = C^^-Co\ Gw = Smw, (19) 
S ^ ^o'+Ci\ ^ ^ i(Amio,mio)EP - ^ log | det{C-'C,)\, 
mio and sio are defined by (3). 

Remark 3.1. The equation (19) giving C^q{x) can be modified using the fact 
that 



Aw = \ (cr'/VioCr'/' - C^^'^WoiC^^'^) where W,, = I-Cl'^Cj^Cl'^ 

(20) 
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This modification has two advantages. It involves Wij which play an important 
role in the infinite dimensional framework (see remark 3.2). On the other hand, 
it involves Wio as much as Wpi which can lead in practice (while estimating Ai^ ) 
to a symmetric procedure that does not give more importance to any group. 

In the classification problem, a polynomial of degree two (x) is used as a 
substitute for £10. We decide that X comes from class one if it belongs to 

T> = |.T e MP tq £g,(x) > 0| , (21) 

The following theorem gives our solution to Problem 1. 

Theorem 3.1. Let j be a gaussian measure on M^. Suppose that C^q is a 
polynomial of degree two on W and that we have ||£ioIIl2(7) ^ ^ /o?" r > 0. 
Then, for all q £]0, 1[, there exists ci(r, q) > such that 

n{ly)<c^{r,q)\\C?,^C%ri\^y (22) 

where V is given by (21) and TZ by (8). 

We emphasise the fact that Ci(r, g) depends only r and q. In particular it 
does not depend on the dimension p of the problem. The proof of this Theorem 
is given in Section 8. It is implicitly infinite dimensional, and the preceding 
theorem could have been stated in an infinite dimensional framework. We do 
not want to introduce this complicated framework and we refer to [8] for an 
introduction to the subject. The infinite dimensional framework highlights a 
particular aspect of the problem that is contained in the following remark. 

Remark 3.2. [infinite dimensional framework] When X is a separable Hilbert 
space (it can also be a separable Banach space in the case of LDA) two gaussian 
measures 7Ci,aii o.'i^d 7Co,aio ^^^^ ^'^^ '^''^ equivalent are orthogonal. 

If these measures are orthogonal then the observed data from the two classes 
are perfectly separated andC{g*) = 0. In this case one can hope to obtain C{g) = 
for a reasonable classification rule g (Even if it is not trivial, see Theorem 2.2 
in the linear case). 

A necessary and sufficient condition for these measures to be equivalent is 
that 

mio = A^i - Mo e -H"(7Ci,Mi) = H{jCo,t^o), (23) 

and 

Ww^I- C\I'^G^^C\I'^ e HS[X), (24) 

where i?(7) is the reproducing Kernel Hilbert Space associated with a gaussian 
measure 7 and HS{X) is the space of Hilbert Shmidt operators with values in X 
(see corollaries p293 in [S]). In particular, the eigenvalues ofWw are in P. In 
the case where they are equivalent, one can define Ciq as a limit (almost surely 
and L2) of its finite dimensional counterpart. This can also be understand as 
measurable and squared integrable (with respect to 7Ci,aiiJ polynomials of degree 
two in X (see Chapter 5.10 in [S]). 
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3.2. Comment and Corollary 

. Suppose C'^q{x) is defined substituting Gio, sio ^lo and c to Gio, sio Aio and 
c in (18). If we note 

(5o = c - c+ (Gio + (Ajo + ^io)(sio - sio),sio - sio ) , (25) 

[A* is the transpose of a matrix A) 

,5^ = Gio - Gio + (i^o + ^o)(sio - sio) (26) 

and 

<5Q = iio-Aio, (27) 
we then get, by straightforward calculation: 

VxeRP C%{x) ^ C%{x) + 5o + {5'' ,x^sio)v,p-\{S^{x~sio),x^sw)^,. (28) 

Also, are result are about quadratic perturbations of quadratic rules. 

The following corollary of Theorem 3.1 is easier to use. 

Corollary 3.1. Let X — MP and C he a symmetric positive definite matrix on 
W . Suppose that there exists r > such that H'CioH^^^..^^ j > r. Then, for ly 
given by (21) and for all < q < 1 there exists ci(r, g) > such that: 

nily) < ci(r, q) Q|lG(Aio - A.onlsim + II^'^'^^'IIk'' 

+2Sl + ^trace^{C{Aw~Aw)) 

where 6^ is given by (26) and 5q by (25). 
Proof. Let us recall that 5'^ is given by (27). We have 

= ||i(5Q(a;) -E^^fe«(X)]) - {6\xW. - {So - ^E^c fe« (^)])llL(7c) 

< ^Var{qcif2sQc^/2iO) + Vari{C'^^S^ , Or-) + + 2E2^ [^cv^^qcv^ (0] 
(C lip.o, note that there is equality here) 

= IwC'/'SQC'/Ynsiu.) + WC'/'S^Wl + + '-trace' {C'/H^C'/'). 

□ 
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3.3. Comparison of this result with those obtained for LDA. 

The preceding theorem and its corollary are less powerful than those obtained 
for the LDA procedure and some conjectures might be made in a parallel with 
Theorem 2.1. In this theorem and in Theorem 2.2, both concerning linear rules, 
we explained and quantified how parameter estimation errors are less important 
when ||-F'io||l2(7c) large. This observation was based on the presence of a term 
exponentially decreasing with ||^io||l2(7c) ^^'^ quantities which determine the 
upper bound to the learning error (and as a consequence the excess risk). In 
Theorem 3.1 concerning QDA procedure, we did not obtain that type of term. 
Nevertheless, Remark 3.2 (more precisely the relation this leads to equivalence 
of the measures) allow us to conjecture that such a term exists. 

We also have to clarify the hypothesis under which the norm of Cf^ is lower 
bounded. Let us recall that this hypothesis guaranties that the constant Ci in 
equation (22) is independent of the parameters of the problem. In a parallel with 
the results obtained for the procedure LDA the lower bound that is required for 
the norm of £^ corresponds to the assumption that the two groups considered 
can always be distinguished. We believe that even if this hypothesis is natural, 
it is deeply linked with error measure that is used in our proof: the learning 
error. Hence, it is obvious that the excess risk is small when the data cannot 
be distinguished (see Section 6 for a fuller discussion) but our result docs not 
reflect this fact. 

We do not discuss the estimation of Gio which leads to the same analysis as 
that for Fio in the case of a linear rule. Let us now discuss the estimation of 
Wio (and Woi). 

3.4. Thresholding estimation of an operator and linearisation of a 
procedure. 

Recall that Wio is a symmetric matrix. Suppose we know an orthonormal base 
in which it is diagonal. Let Aio = {^ioi)i=i,...,p be the vector of its eigenvalues. 
To build the estimator Wio of Wio, we have to estimate its eigenvalues. It 
remains to measure the learning error and hence the estimation error of the 
eigenvalues vector in P norm. Suppose that p tends to infinity. We will recall 
later that if the measure of class and 1 tend to equivalent gaussian measure in 
a separable Hilbert space, then Wio tends to be Hilbert-Schmidt. This means 
that Aio stays in P{N). Once again, if Aio has coefficients decreasing sufficiently 
fast, the thresholding estimation should be used. This thresholding estimation 
is no longer a reduction of the dimension of the space in which the rules acts, 
but becomes a linearisation of the classification rules -It can be interpreted as 
a reduction of the dimension of the space in which the used rule lives- Indeed, 
let Wio = X]j=i -^10461 ® Ci for I < p and (ei)i=i,...,p be an orthonormal bases of 
MP, we have: 
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Figure 1, Separation of the data in a direction where the variances are different. The two 
groups can be identified with their ellipsoids of concentration: a horizontal ellipsoid and a 
vertical ellipsoid, the two groups have the same mean, but different covariance, which makes 
the data quite well separated. One can take advantage of this separation only if a quadratic 
rule is used. 
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where g(x) is affine and defined on K^. In this case, the plug-in rule is affine 
in a subspace of dimension p — I and quadratic in the subspace of dimension / 
spanned by (ei)i=i,...,;. 

— 1 /"^ —1/2 

Let us note that because Wio ~ I — CqC^ , setting the eigenvalues 

of Wij to zero in a subspace of W, is equivalent to choosing a subspace in which 
the covariance matrices C'l and C'o are "close enough" . In this subspace, one can 
suppose that Ci equals C'o- The classification rule, in this subspace, is linear. 
Figure 1 illustrates the case where the eigenvalues of Wio are big enough and 
why a quadratic rule is better in that case. 



4. Classification procedure in high dimension: a way to solve 
Problem 2 

4.1- Introduction. 

In this section, we give a practical method of classification for gaussian data in 
high dimension and hence present our contribution to Problem 2. Note that if we 
only treat the binary classification problem, it is easy to extend our procedure 
to the case of K classes as we have done in [\ ^>]. Recall that we are given ni 
observations from Pi and no observations from Po- We will note n — ni + no- 
We suppose that each of the Uk vectors of group k is composed of the p first 
wavelet coefficient (see [2()]) of a random curve from X = L^[0, 1] which is a 
realisation of a gaussian random variable Pk = ^Ck,p.k ^f unknown mean and 
covariance. 

Recall that a learning rule can be defined by a partition of W. We construct 
this partition R'' \ of W with the use of a frontier functions Cio'- 

f = e : Cio{x) > 0| , (29) 

which should be given in the sequel. 

We divide here the presentation into two parts. In the first part, we give a 
theoretical result in the case where the covariance matrices are supposed to be 
known. In the second part, we give the method that is used when the covariances 
are unknown. We keep the notation of the preceding sections. In the case of LDA 
procedure, wio = — -Fio = C~^mio, sio = ^^9^° , and in the case of the 
QDA procedure, do = ^(Cf ^ + C(7^)mio, Aw = ^ - Cq\ 

4-2. Case of known and equal covariance: procedure and theoretical 
result. 

Notation and assumptions. Let p.k be the empirical mean of the learning 
data {Xik)i=i^...^nk of class k. We suppose here that the covariance of group 
and 1 equal C, and that sio is known. The separation frontier between the 

imsart-generic ver. 2007/12/10 file: article-f inall . tex date: July 10, 2008 



R. Girard/High dimensional gaussian classification 20 

two groups is affinc and -Fio is tlic only unknown parameter. We suppose that 
the learning set is made of ni = no = n{p)/2 p-dimensional vectors. We give a 
method to construct an estimator of Fiq and give theoretical results when n(p) 
tends then to infinity much more slowly than p. 

For g > 0, the ball is composed of the vectors 9 eM.p such that 

i=l 

We will note 

fip(e(i?), r) = {{x, y, C) eW xW X Cp such that (30) 

- y) e e(i?) and \\C-^^\x - y)||Rp > r} 

where Cp is the set of symmetric definite positive matrices in MP. If (/io, ^^l,C) G 
^p{Q{R),r), we will note 

2?(£io)=C(V)-C(]|y), (31) 
where V is given by (29) and V is given by (2). 

The Procedure. The plug-in rule affect the observation X to class 1 if it 
belongs to V defined by (29) where 

•^10 = (^10; ^ ~ sio)rj' ■ 

We estimate F^q — C^^mio by ^'lo — C^^mio, where the coefficients of C~^/^TOio 
are given by 

(2^ioil|yio,l>Afo"«);^^ where yioi = (c-i/2(/ii -^lo))^^^ ^, 

and Xi(f^ is chosen by the Benjamini and Hocheberg procedure [4] for the 
control of the false discovery rate (FDR) of the following multiple hypotheses: 

yi^l,...,p Hoi : E[yioi]=0 : Versus i/o; : E[yioi] ^ (32) 

We recall that this procedure is the following. The (|2/ioi|)i are ordered in de- 
creasing order: 

|yio(i)| > ■ • • > \yw{p) \ and Xf^f^ |yio(fcf„««)l 
where fcfg^^ = max Ik e {l,...,p} : |2/io(fc)| > 



1 / bpk 



n{p) \ 2p 

z{a) is the quantile of order a of a standardized gaussian random variable and 
hp G [0, l/2[ is lower bounded by where cq is a positive constant (which 
does not depend on p. 
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Theoretical result 



Theorem 4.1. Let R > 0, and q e]0, 2[. Let V be defined by (29) and rjp = 

p^^ R^/n{p). Suppose that p tends to infinity. If 
then, for r > 0, we have 



sup Ep®n 



< 



l+Op(l) 



V2- 



Rn^/^{p) 



where 'Dp is the excess risk as defined by (31 ), and P^" is the law of the 
learning set. 

Proof. The covariance matrix of tlie vector C~^^^{pi — p,o) equals -fp;^^- We 
tiien liave to use successively Theorem 2.1 (of this article), Theorem 1.1 of 
Abramovich et .al [1], and Theorem 5 point 35. of Donoho and Johnstone [11] 
to be able to write, Vr > 0: 



sup iUpe 

(/io,Pl,C)Gnp(;9(_R),r) 



< 



l+Op(l) 



V2 



log 



1/2 



Rin{p)i/'^ 



Rn^l^{p) 



2~q 



This inequality leads to the result by the use of the Jensen inequality: 

1/2 



Vp{Cw) 



< 



□ 



Comments. Let us make a few remarks on this result. 

1. The rate of convergence is faster when q is close to 0, and slower when 
it is close to 2. This leads us to consider the sparsity of C~^^^{fio ~ Mi)j 
and makes the use of the wavelet basis attractive. On the one hand, it 
transforms a wide class of curves into sparse vectors and on the other 
hand, it almost diagonalises a wide class of covariance operators. 

2. We could obtain the same speed with a universal threshold (i.e with the 
threshold Xjj = ^^^-^2 log(p)). In this case, the constant ''""^"S^^'' would 
not be that good (cf [1]). 

3. We are not aware of any results concerning the convergence of any classifi- 
cation procedure in this framework (the high dimensional gaussian frame- 
work with the set of possible parameter determined by f2p). Indeed we do 
not make any strong assumption on C. Bickel and Levina [6] as well as Fan 
and Fan [12] suppose in their work that the ratio between the highest and 
the lowest eigenvalue is lower and upper-bounded. Even if our Theorem 
doesnot treat the case where C is unknown the hypotheses we use seems 
more natural. Let us recall that if K is a gaussian random variable with 
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values in a Hilbert Space, then the covariancc operator is necessarily nu- 
clear. Also, the assumption used by the above mentioned authors does not 
allow us to consider gaussian measures with support in a Hilbert space. 
4. Finding the significant component of the normal vector Fiq defining the 
optimal separating hyperplan is equivalent with finding the significant 
contrast in a multivariate ANOVA. Hence, controlling the expected false 
discovery rate in this ANOVA is sufficient to get a good classification rule. 



4-3. The case of different unknown covariances 

For the rest of this section, if /c G {0, 1}, jlk will be the empirical mean of the 
Learning data of class k. We are going to use a diagonal estimator Ck of the 
covariance matrix Ck- The diagonal elements of Ck will be ((T^g)g=i,...,p. For 
q G {1, . . . ,p\, k G {0, 1}, (T^^ will wc the unbiased version of the empirical 
variance of feature q of the observations {Xikq)i=i.,,,^nk of class k. We will note 

sio = (mi + Ao)/2. 

The classification rule used chooses that X eM.p comes from the class k ii X 
belongs to Vk given by (29) and 

-Cio = -■^(Aioix - sio), X - sio)rp + (Gio, x ~ sio)rp - cio, 

where the quantities of this equation will be given in what follows, for all 
(1,0) G {!,..., i^}^, 1 7^ 0, we now give Gio (equation (33)), Aiq (equation 
34), and cio (equation 35). 

We estimate do = ^{Ci^ + C'(^"^)™io by 

1/2 



where ywg = ^ + 

V2 \<^lq <^0q 




(33) 



and Xfff^ is chosen by the Benjamini and Hocheberg procedure. This procedure 
is the following. Let Varo^yijq) be the variance of yioq calculated under the 
hypothesis that fiiq ~ fiQq. The term 



2ni 2no 

'^Iq 



is an estimation of this variance when a? {k = 0, 1) are known and equal to 



af.^. In practice, we substitute these terms for Varoijjioq). The real 

i\ywq\/ \/Varo{yioq))q^i^,„^p 
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are ordered by decreasing order: 

l2/io(i)l/y^^a?'o(yio(i)) > ■ • • > \yw(p)/\JVara{yw{p))\ and Afo^^^ = |?/io(fcfo«)| 
where 



,FDR J , I I ^ ^ + ^l{k)l^l(k) , ^ + ^l(k)/^\k) fbpk 

fcio" = nmx fc : |y,o(.)l>\/ + z [-j^ 



z{a) is the quantile of order a of a standardized gaussian random variable and 
bp G [0, 1[ is as in the preceding algorithm. 

In practice, we choose bp ~ 0.01, but one could keep a part of the learning set 
to learn the best value of bp. Note that in the application we have in mind, the 
learning set is too small to be divided. In addition, the choice of bp, in view of 
Theorem 4.1 does not determine the performances of the algorithm. In practice 
the difference of classification error between the choices bp = 0.01 and bp ~ 0.05 
for example, is not important. 

This first part of the methods constitute a dimension reduction. Indeed, the 
only coordinates of (Giog)i3=i,...,p that are kept non null are those for which 
\yiOq\ ^ ^ij^^ ■ "^^'^ linear application associated with (G'iog)q=i....,p only acts 
in k[ff^ directions. Let us also note that if we extend our procedure to a mul- 
ticlass procedure, for two couples of classes ^ {l,m), the corresponding 

estimations dj and Gim might be based on different dimension reduction. 



Remark 4.1. The testing procedure used can be analysed as a "vertical" ANOVA 
that reveals the interesting direction 

1. in which classification should be done (with thresholding estimation o/Giq) 

2. in which classification should be quadratic (with thresholding estimation of 
Aw). 

The matrix Aiq is estimated by a diagonal matrix with diagonal elements 
given by 

(34) 

and the threshold ^]i(f^ is chosen with the same type of procedure as the one 
used to find Xf^f^ . Let Varo{wioq) be the variance of wioq under the hypothesis 

that CTin = (Tn„. The term — ^- H ^ is an estimation of it that we use in 

practice. The real numbers {\wiQq/ ^yVaro{wlQq)\)q are ordered by decreasing 
order: 

\ww{i)/^JVaro{wiop)\ >■■■> \ww^p-,/ ^Jvaro{wlOp)\ and t][„^^ = kio(fcf(,"«)l 
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This part of the method constitutes a hnearisation of the rule. Indeed, the 
directions g G {1, . . . ,p} in which aiog is are the directions in which the clas- 
sification rule between the groups 1 and is linear. In the other directions, the 
rule is quadratic. 

The use of this methods is still motivated by Theorem 4.1 and the theorems 
used in its proof, but it needs additional theoretical justification. 

We will finally note: 
cio ^Xl^l^-io^lS:')!™ (30109(^19-/^09)^+ 2 log I det((T(^/ (Tig)]) . (35) 

q—l ^ ^ 

5. Application to medical data and the TIMIT database 

We are going to study the performance of the given procedure. With that aim, 
we compare our method with the one given by Rossi and Villa [22] on the 
database TIMIT. We then use test our procedure on medical data. 

5.1. Comparison of our method with the one of Rossi and Villa in 
the case of two class classification 

Rossi and Villa use a support vector machine (SVM) with different types of 
kernels. Recall that the SVM procedure is to construct an affine frontier function 
/ given by 

f{x) ^ {w,x)mp + b, 
where w and b are solutions of an optimization problem of the following type: 

N 

w,b.^ ^ — ^ 

2—1 

under ((w, Xi)^" + &) > 1 - 6, Ci > i = 1,. . . ,n 
where (xi, ?/i)i=i_..._„ are the couples (observations, labels) of the learning set. 

The TIMIT database has notably been studied by Hastie et al. [18]. This 
database includes phonemes " aa" and " ao " pronounced by many different per- 
sons. The corresponding records are curves observed at a fine enough sampling 
frequency. More precisely, one curve is a p-dimensional vector with p ~ 256. 
The learning set is composed of 519 "aa" and 759 " ao " and the test set is 
composed of 176 "aa" and 263 "ao". Also, the curves (2:^)^=1^. ..,519 are those 
which correspond to the pronunciation of phoneme "aa" and the label yi ~ 
is associated to them. The label "1" is associated to the other curves which 
correspond to the pronunciation of phoneme " ao " . The method of Rossi and 
Villa gives almost the same results as ours: 20% of classification mistakes. 

imsart-generic ver. 2007/12/10 file: article-f inall . tex date: July 10, 2008 



R. Girard/High dimensional gaussian classification 



25 



5.2. Application to medical data 

The medical problem is the following. In Magnetic resonance imagery, one can 
obtain spectra characterizing tissues localized in some area of the brain. The 
spectra obtained can be used to characterize tumors. Unfortunately, even for 
a specialist, it is hard to define a good rule to associate the name of a tumor 
with a given spectra. Some spectra have been obtained on identified tumors. We 
have been given these spectra. In order to have enough spectra in our learning 
set, we retained five groups of spectra (some of them regrouping many tumors). 
The glioblastomes of the first type^, the glioblastomes of the second type, the 
Meningiomes, the Metastases and the healthy tissues. The database provided by 
the specialists contains 21 glioblastomes of first type, 9 glioblastomes of second 
type, 16 Meningiomes, 18 metastases and 9 healthy tissues, that is, 75 spectra 
sampled at 1024 points. We give the plot of the spectra considered in Figure 
2. In order to test our procedure, we used a strategy of type "leave on out". 
Figure 4 leads us to an experimental confirmation that in the case of two class 
classification, the chosen dimension is a good one. 

We tested different configurations summarized in the table Figure 3. The 
classification error rate is still significant, but the reduction dimension procedure 
provides a reduction of the error rate (Recall that in the case of 4 groups having 
equal a priori probability a rule that would guess randomly the type of tumor 
would have an error rate of 75%). There are two reasons for this moderate 
performances. 

Roughly, theoretical physic predicts that a spectrum associated with a given 
tumor, for example a Glioblastome, is a random variable y = {yq)q=i....,p that 
has a quite small variability. Also, we shuold be able to separate easily spectra 
associated with different groups. Unfortunately, in practice, the instrumentation 
leads to a measurement of spectra z = (2:g)ij=i,...,p having complex values and 
for which there exists a sequence of angles {'il'q)q=i...^p such that: 

V<ze{l,...,ri yq^^e^^^z,). 

This sequence of angles is unknown. The theoretical physics of instrumentation 
shows that there are two real (a, b) such that 

Vq e {1, . . . ,p} ipq =aq + b. 

Methods to obtain a and b arc not sufficiently efficient, but this represents 
an active field of research. We chose to ask the physicians to change the phase 
manually in order to have a homogeneous real part of the spectra in a particular 
group and we kept the real part of the spectra. The change of phase made by 
the physicians is not optimal and the residual variation of the phase creates a 
certain disparity of observed spectra inside each group. This disparity can be 
seen Figure 2. The incorporation of the phase into a classification algorithm, 

^The group of Glioblastomes has a too large variability, also, we chose to divide it into 
two groups: first type and second type. These two types correspond to the presence of certain 
chemical substances. 
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(a) 21 glioblastomcs A (b) 9 glioblastomes B 




(c) 16 Meningiomes (d) 18 metastases 




(e) 9 healthy tissues 



Figure 2 . Spectra of the learning set 



Groups considered 


all 


all except 
Metastases 


Glioblastomes of first type 
and Meningiomes 


error rate 


43 % 


30 % 


5% 



Figure 3. Considered groups and error rate in each case. 
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Figure 4. Classification error rate (in a two group problem: Meningiomes versus Glioblas- 
tomes of first type ) as a function of the selected dimension. The dimension selected by our 
algorithm is marked by a black point in the Figure. 



and the use of the complex nature of the data will be the object of further 
studies. We note, however that these phase problems in the Fourier domain can 
be translated interestingly in the temporal domain. 

Finally, the learning set is still too small. We hope to see the size increase in 
the forthcoming years. 

6. A more geometric alternative measure of error: the learning error 
6.1. Definition and main result 

We have already defined the learning error to be 

'R{g)^P{g{X)^Y ci g*{X)^Y), 
which when F Z-/({0, 1}) equals 

= \ {PliaiX) ^ 1 et g*{X) = 1) + Po(,g(X) ^ et g*{X) = 0)) . 

In other words, the learning error is the probability to misclassify X with g and 
to classify it correctly with g* . The point that motivates the use of this error is 
that 

1. it leads to a simple geometric interpretation (mostly used in the two follow- 
ing Sections) and hence it is used in all the further theoretical development 
we will give; 

2. it is not sensitive to the possible indistinguishability of the distributions 
Po and Pi and it leads to lower bounds as in Section 2 (see remark below). 
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It follows easily from 

C{g) - C{g*) = P{g{X) ct g*{X) = Y) - P{g{X) = F ct g*{X) ^ Y), 

that a classification rule g satisfies: 

C{g)-C{g*)<Tl{g). (36) 

In the gaussian case that is studied in this article, we proved the following 
theorem that gives a reverse inequality of (36). 

Theorem 6.1. Let g* he the optimal rule in the binary classification problem 
(as presented in Section \). 

1. If Pq and Pi have the same covariance C and respective means fii and fiQ, 
then, for all measurable functions g : {0, 1}, we have: 

C{g) - C(,*) > min|^||C-V^rn,o|k.e '"^"^"'-- 7^(g)^^| , 

where mio = /xi — /xq. 

2. Let ci > and 'P(ci) be the set of couples {P,Q) of gaussian measure 
on MP such that di{P,Q) > ci. // (Pi,Po) G ^(ci) ^^en there exists a 
constant c(ci) > (that only depends on ci) such that 

C{g)~C{g*) > min |c(cl)7^(g)^ ^| . 

Before we prove this result, let us comment it. 

Comments. Let us note that 

C{g)-Cig*) < idi(Pi,Po). 

Also, in the case where di(Pi, Pq) tends to 0, the excess risk does not measure 
the difference between g and g* but the proximity of Pi and Pq. The learning 
error is not sensitive to this scale phenomenon, as witness the following example. 

Example 6.1. Let fJ. > 0, Pi = 1) and Pq = J\f{—fi, 1). In this case, for 

all aeR 

7^(]l[a,oo[) = I (Pio <^ + ^l<a) + p{a<^-^l<o)), 

where ^ ^ A/'(0, 1) ; and di(Pi, Pq) if and only if fi in which case 
7^(ll[a,oo,[)^^^'(ee[0,|a|]). 
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Under these conditions, the learning error associated with ll[a,cx),[ tends to 
only if a tends to 0. In other words, when /i — > 0, the learning error makes a 
difference between the rules ]l[ioo,oo,[ and g* = ]l[o,oo,[' 

inf 7^(ll[loo,oo[)>^P(ee[0,|50|])«i 
^l<50 I 4 

while we have 

C(]l[ioo,oc[)-C(5*) < ^di(Pi,Po) < 

Remark 6.1. By definition, is the quantity of interest. The problem with it is 
that it can gives credit to every given procedure when di(Pi,Po) is sufficiently 
small. Also, one cannot argue that a rule is never good according to the excess 
risk. In the preceding example, the procedure g{x) = ]l[ioo,oo[(2;) is uniformly (on 
say < 50) inconsistent according to the learning error but not according to 
the excess risk. 

The main consequence of this Theorem has already been used in Section 2.2. 
From equation (36), if {gn)n>Q is a sequence of classification rules such that 
T^{gn) tends to zero, then C{gn) —C{g*) tends to zero. Theorem 6.1, implies the 
converse result. 



6.2. Proof of Theorem 6.1 

Proof. Let us take 

Ki = {xeW : g{x) 7^ 1 et g*{x) = 1} 

and 

= {x e W : g{x) 7^ et g*{x) = 0}. 

Also, 7?.(g) = i (Pi [Ki) + Po(A'o)) and at least one of the following two inequal- 
ities is satisfied (from the pigeonhole principle): 

Pi{K^)>n{g), Po{Ko)>nig). 

Without loss of generality we will suppose that Pi(A'i) > TZ{g) which implies 
Pi{Ki) + Po{Ki) > TZ{g). Note that we have 

Cig)-C{g*) = Pig^Y)-Pig* ^Y) 

= 1 (Pi(A-i) - Pi(A'o)) + i (Po(ifo) - PoiKi)) 
( by conditioning with respect to Y) 

= \ m - Po)iKi) + (Po - Pi)(A'o)) , 
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and, because g* {X) = 1 li and only if dPi > dPo (by definition of g* and from 
the fact that Y U{{0, 1})), we get 

C(g) -C(/) = Ik.ukMPi - dPo\ >^J U-JdA - dPo\. (37) 

A straightforward calculation (see for example [15] Proposition 1.4.2 Chapter 1 
Part I) leads to 

J m{x){dPi - dPo) = 2Ep m{X)ef^°^^^^^\smh (^^Cio{X) 

for all measurable m, where P is any probability measure that dominates Pi 
and Po, /io(P,^) = 5log(^^) and £10(2;) = log(^(a:)). In particular 



di(Pi,Po) = 2Ep 



^/-(^■^)|sinhQ£io(^) 



Also note that whenever K C {x gMP : £io{x) > 0} we have 

Pi{K) - Po{K) = 2Ep[l,^e/-(^^^) sinh(£io(X)/2)], 

and as a consequence, (37) can be rewritten 

C{g) - C{g*) > E[lK,(A)e^"(^'^)sinh(£io(A)/2)]. (38) 

It can also be shown that 

Pi{K) + Po(A') = 2Ep[Ue^i"(^'^) cosh(£io(A)/2)], 

and consequently, Pi(A'i) + Po(Ari) > TZ{g) is rewritten 

2Ep[lKi(A)e-^-(^^^)cosh(/:io(A)/2)] > TZig). (39) 

On the other hand, rfi(Pi, Po) > ci leads to: 

2Ep[e/"'(^^^)|sinh(£io(A)/2)|] > ci. (40) 

In the rest of the proof, we shall combine (39) and (40) in order to lower 
bound the right member of (38). We remark that the left member in (39) and 
the right member of (38) only differ by a factor two and replacing a sinh by a 
cosh. For our purpose, these two functions only differ fundamentally near zero. 
We are going to decompose Ki into two disjoint sets. Also, we will define 

A'i+ ^{xe Ki : Cio{x) > 2} et A'f ^ {x e Ki : Cw{x) < 2}. 

Let us also define A and B by: 

eMP-^hmh{Cio{x)/2)Pidx) ^ [ eJ"'^P'^Uin\i{Cio{x) /2)P{dx) 

Ki Jk + 



A 

eho(P,x) sinh(/:io(a;)/2)P(dcc) 



k: 



imsart-generic ver. 2007/12/10 file: article-f inall . tex date: July 10, 2008 



R. Girard/High dimensional gaussian classification 31 

From (39), (and the pigeonhole principle) two cases can occur. In the first case 
Ep[l^+(X)e/-(^'^)cosh(/:io(X)/2)] > 7^(,9)/4, 

and in the second 

Ep[l^- (X)e/i°(^^-) cosh(£io(X)/2)] > n{g)lA. (41) 

In the first case, because X e implies 

sinh(£io(X)/2) > icosh(/:io(X)/2) (ln(6) < 2), 

we have A > Tl{g)/8 and hence the desired result ( it suffices to remark that 
^iq{x) > if x G A'l which implies B > 0). 

We shall now consider the case where (41) is satisfied. In this case, because 
cosh(x) < 2 for all |a;| < 1, we have 



e 



ho{P,x)p^^^^ > 7^(.g)/8. 



Also, the definition 

dv 



makes v a probability measure on MP and 

y{K^) > 7^(.g)/8. (42) 
On the other hand, (see the definition of /lo) 

ye/io(P,-)rfP^ J ^/dI\dP„=A2{Pi,Po) 

(^2(^1, Pq) is the Hellingcr affinity between Pi and Pq) which leads to 

B = A2{PuPo) i^{X £ K:[ and\smh{Cio{X)/2)\>t)dt. (43) 
Jo 

We have 

i^{X e A'f) ^iy{X e and | smh{Cw/2)\ < t) 
+v {X e /vf and | sinh(/:io/2)| > t) . 
Let g be the application which associates to i > the real 

g{t)^ sup v{\s\nh{Cw{X)/2)\<t). (44) 

(-Pi,Po)6-P(ci) 
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For every t > 0, wc have: 

V {X e and | smh{Cio/2)\ > t) 
= iy{X e K{) -iy{X e and | smh{Cwl2)\ < t) 

We then deduce from this inequahty and from (43) that for all e > 0, 
B>A2{Pi,Pn)( y{XcK^ and | sinh(/:io(X)/2)| > 

> ev{X e ) - yl2(Pi, Po) / v {X iff and | sinh(£io/2)| < t) dt) 

Jo 

> en{g)/8 - [ u{X eK~ and | smh{Cw/2)\ < t) dtA2{Pi,Pa) 

Jo 

where this last inequality results from (42). The rest of the proof relics on the 
following lemma. 

Lemma 6.1. 1. The application g defined by (44) leads to 

g^t) < -P^t'^' 

' - A2{PuPo) 

(c(ci) is a positive constant that only depends on ci). 
2. In the case where Ci = Cq = C , we have 

At 

iy{X e A'f and \ sinh(£io/2)| < t) < 



2n\\C~^/^mio\\ 



We prove this result at the end of the current proof. Let us note that it is 
equation (40) that plays a crucial role in the proof. 

In the case where Ci ^ C2, 

[ i^{X e and |sinh(£io/2)| < t) dtA2{Pi, Pq) < c{ci)e^+^^'^ , 
Jo 

and the choice e = (j^§^c{ci)^ leads to the desired result. In the case where 
Ci — C2, 

2e2 



/ iy{X e /if and | sinh(/:io/2)| <t)dt< 
Jo 



2^||C-i/2mio|| 



and the choice e = ^/2tt\\C ^^^mio ||rp ■^2M{Pi Pq) l^^ads to the desired result. 
Indeed, in the case where Ci = Co, classical calculation leads to 

l|C-l(f<i-M0)ll^p 



A2{Pi.Po) = J e^"'^^'^'dP 



□ 
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Let us now prove Lemma (6.1) 

Proof. Let us begin by point 2. It is sufficient to notice that if Px|o is a- gaussian 
measure with covariance C and mean sio, and if X is a random variable drawn 
from Pi|o, then 

g/io(Pi|o,x) ^ g 5 ^ in distribution Lxq{X) ^ Af{0,a^), 

where = \\C~^{^i — a^o)IIep- Also, we get 

i/(|sinh(£io(X)/2)| < t) = P(|AA(0,(t2)| < 2Argsinh{t)) < ^^^f^^'^^W 
< 



27ra 
it 



'2-Ea 

Let us now prove point 1 of the Lemma. 

l|sinh(£io(x)/2)|<t ( dPoM2(Pl,Po). 

^ Pll\\L^^{X)l2\<t) 
A^iPi.P^) 

(from Cauchy-Schwartz and Argshijj) > y). 

Finally, we conclude from point 2 of Theorem 8.4, given in Section 8, which 
hypothesis is satisfied since: 

ci <di(Pi,Po) 



<2^K{Po,Pi) 

(from Pinsker inequality (see [24])), 
< 2||/^io|Il(^p„) 

(from Cauchy-Schartz inequality). 

□ 



7. A geometrical Analysis of LDA to solve Problem 1 
7. 1 . Introduction and first result 

Let A" be a separable Banach space X = W, endowed with its Borel tr-field and 
a gaussian measure 7. Throughout the next section, we will associate to any 
measurable / the set 

Vf = {xeX : fix) > 0}. (45) 

In this section X = W. Recall that a (defined by (5)) is the angle, according 
to the geometry of ^2(70) between _Fio et Fiq. This quantity will play a very 
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important role in the whole section. In order to shorten the notation, we will 
replace 7?. ( II y) by 7?. in this section and those that follow. 
Recall that 

P n-i Ml + Mo 
Fio = C mio, TOio = /-ii - Mo, sio = ^ , 

where /xi, (resp. /io) and C are the mean and (common) covariance of the dis- 
tribution Pi = 7c,^i (resp. Pq 

— lc,^^o) of data from group 1 (resp. 0). With 
the above defined notation (45), the optimal rule and the plug-in rule can be 
rewritten with 

y = y{F,o.x-s,o)vj- and = V"(i?,„.:r-sio>MP 

For the purpose of this section, let us note that the learning error studied in 
the preceding section and introduced by equation (8) is (in the case of LDA) 

T^=\ (7c,Mo [xeV\v) +7c,p, {xeV\V 

which implies 

n = \ (7C...0 ^ \ ^ - ^) ) + {xe{v\v + ^ 

(46) 

The Problem now becomes to that of measuring two areas of W with 7c,sio- 
Standard properties of gaussian measure now leads to 

^ = -jlp (^(^(.,Gp>j;p \ V(.^Gp+ep>sP-Hdo) - (47) 

where do = (i^io; sio - sio)rp, 

Gp^C^'^Fw^C-^'^mw, Gp^C^^^Fio and ep ^ C^^^Fio - F^o). (48) 
One may note that the change of geometry implies 

||Gp||rp = ||i^io||L2(7)' IIGplkp = ||^'io||l2(7)' ll<5pllp II-P'io-^io||l2(7c)' (49) 

and a (defined by equation (5)) is the angle, in the geometry of IRp between Gp 
and Gp. 

The following theorem gives lower bounds and upper bounds on the learning 
error TZ as functions of (among others) a. Its proof relies on the fact that TZ is 
the measure by 72 of two "simple" areas of R^* (see Figure 5) and the use of four 
elementary properties of gaussian measure to be given later (see Figure 6). 
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Theorem 7.1. Let do = (Ao, ^lo ^ sio)kp • The Learning error TZ as a function 
of a satisfies: 

Va e [-7r,7r] 7^(a) = 7^(-a). 
The Learning error also satisfies the following inequality 

Ifa>l, thenU > \. 

If < a < ^ , then we have TZ < and we distinguish between four cases. 
1- If\do\ < i|(Fio,Fio>L2(7c)b 



11^-10 II 



^2(tc) I j a 1 
' 4 2^+2^^ 



0;- 



I do I tan(a) 



InF-ii^lolU: 



he) 



and 
7^ < 



0; (1 + tan(a)) 



< 7^, 



I do I tan(a) 



(50) 



l|nf^ii^lo||L2(7c) 



ll^-ioll 



1 /I 



4 V2 



T7i 



Or 



10||L2(7c) 



27r y 



< 7e 



0; (1 + tan(a)) 



|o?o| tan(a) 

np-|,-P'lo||L2(7c) 



3. //jK-Fio, Ao)l2(7c)I < Mol' '"'^ 



a 1 
47r 4 



Mo| =0, t/ien we have 



71 



0; 



ll^ioll i2(7c) 



<7^, 



(51) 

(52) 
(53) 

(54) 



0; (1 + tan(a))- 



|do| tan(a) 



IIFioll 



^2(tc) a „ 

^ — — < 7^. 

2n - 



(55) 



Proof. Step 1: The problem is two dimensional We shall prove this equality: 



(56) 



n = \i2 (Q^ - y+) + \i2 {Qt - y-) , 

where Q°i , , y+ and y- will be defined below. Q"i and are two areas 
y+ and ?/_ are two vectors of and all these quantities are illustrated Figure 
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5. In the following wc shall use the notation Cp = liQ^Cp for the orthogonal 
projection of Cp on the orthogonal to Gp in MP . Wc will suppose that PpIIrp 7^ 
0, since the part of the result concerning ||ep||Kp = is straightforward. The 
calculation of TZ is intrinsically a calculus in the two dimensional space Afp, 
spanned by Gp and ep. In order to make this fact clear, note that for all zi G Mp 
Z2 S Afj^ wc have: 

^(.,Gp+ep)EP+do \ ^(.,Gp)kp + Zi+ Z2 = V(^,^Gp+ep)„,p+do \ ^(.,Gp)ep + ^1 

and 

1^(.,Gp)kp \ V(..Gp+ep)EP+do + Zl + Z2 = V(.^Gp)3.p \ ^(.,Gp+ep)KP+cio + ^1 

(here Mp was the orthogonal of Mp in MP). By the tensorial property of 7p and 
equation (47), we finally get 

n = ^72 fA/pn(F/ Gp+ep>EP+do\^(..G,>p.p -^)) (57) 



^72 (y^^p n (V(. ^Gp+e^)f,P+do \ ,Gp)g_P - -^)^ 
+ ^72 (^Mp n (V(. ,Gp>HP \ ,Gp+ep)HP+cio + 



(58) 



Also, in the sequel we will identify Mp with M^, D and will be the straight 
lines of Mp with equation (., Gp)rp = and (., Gp + ep)Rp +do ~ 0. It can easily 
be shown that these lines intersect in Up given by 



Op = -c?o H2 ■ (59) 



Also, 



,Gp)ep ~ -ap,Gp)EP ^(.,Gp+ep)EP+<in — ^(.-ap.Gp+ep)3iP: 

and with the same calculus that was used to obtain (47), equation (57) becomes: 

^ = ^72 (^A/pn(F(..Gp+ep>Ep\^(.,Gp>„)-^+ap) (60) 
+ ^72 (^Mp n (y(. ,Gp>,P \ .Gp+ep>Ep) + ^ + «p) ■ 



(61) 



Notice that for reasons of symmetry we can assume that da > without loss of 
generality. In the sequel, we shall use the notation 

2/^ = ^ - Op Gt = - flp, (62) 

the coordinates of in the orthonormal coordinate system obtained from the 
orthogonal coordinate system (0, ep,G'p) will be noted {yh,yv) and are equal 

/ d2_ JIGJIeP N gjj^u j^lgQ jjQ^g 

^ llepllsp ' 2 ' 

Q1 = MpH (V<..Gp+ep)EP \ V(.,Gp>,p) et Qt = Mp n (l^(. .Gp)ep \ V(. ,Gp+ep>EP )• 

(63) 
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D 



tan{/3f) — etan{a) 




Figure 5. Figure giving the definition of I 



, Q+, and for Lemma 7.1 



We finally derive equation (56). From Figure 5, we notice that replacing a by 
—a, TZ does not change; that if < a < 7r/2 then TZ < \ and if tt > a > n/2 
then TZp > 1/2. Also, we will now suppose that a G [0,7r/2]. 

Step 2. The rest of the proof relies on the following lemma. 

Lemma 7.1. Let, Q+ and be defined by Figure 5 forming, with Q°i et Q^_, 
a partition o/R^. Let u = ta.n{a)yh- We then have 

• If (z Q-, then 



^7i([0;|yJ]) + £ + 7i([0,f])7i 



0; 



yv/2 



cos a 



sin(a) 



<72(Q'_-y-) 



72(Ql - y-) < ^ +7i([0; \u\il + tan(a))]), 



(64) 



' If y G Q+, then 



e-'^U^7i([0; Ml) + £ ) < 72(g'_ - 
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• If y~ £ Qt, then 



(7i([0;((l+tan(a))|zi|]) 



2ti 



38 



(65) 



l2{Qt - y-) < (7i([0; (1 + tan(a))|^.|]) + ^) . 

• We have concerning 72 (Q" — J/^)-' 

72(Q--2/+)<72(g--2/"). 

• Finally, if yh — 0, we have 

e-'^^ < 72(Q° - 2/+) = 72(0'- - y-). 



(66) 
(67) 
(68) 



This Lemma will be proven in Subsection 7.3, let us see how it implies The- 
orem 7.1. Fix e = 1 for the rest of the proof (Other values of e will help us in 
the proof of Theorem 2.2). Equation (67) of the lemma implies that 



1 



l2iQt-y-)<n<-f2{Q'L-y- 



Recall that {yh,yv) has been defined following equation (62) as the coordinates 
of J/"*" and that u = tan(a)y^. A simple calculation leads to 



u = \do\- 



tan(a) 



ct yy 



<\^f^„-^io\\l2{^c) 

If ^|(Gp, G'p)rp| < Idol, we have in the preceding Lemma y_ G and: 



Zn 



tan(Q!)||i^io||L2(7c) 



0; (1 + tan(a))- 



47r 



|(io| tan(Q:) 



nFi^^io||L2(7c) 



The case where |do| < \\{Gp,Gp)Kp\ (which means that 2|m| < \yy\) is the case 
where y- G Q+, and we then have: 



Il-Fiol 



0; 



I do I tan(Q!) 

\^F^„Fw\\L2{'yc) 



and 



7^ < e~ 



(C > 



0; (1 + tan(a)) 



Idol tan(a) 
|nFj^^io||L2(7c). 
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If i|(Gp,Gp)Hp| < Idol < \\{Gp,Gp)s,p\, (which means that 2|u| > \y^,\ > \u\) 
we have in the preceding lemma ?/_ £ Qe (e = 1), and since in this case |?/t,| > 
\u\ > l2/t-l/2, we get: 



and 



4 V2 



:7i 



0; 



10|1L2(7c) 



0; (1 +tan(a)) 



This ends the proof of Theorem 7.1. 



27r 



< n 



\do \ tan(a) 
|np-i,^io||L2(7c) 



□ 



7.2. Proof of Theorem 2.2 



Theorem (2.2) is also a consequence of the preceding Lemma. Wc will use the 
preceding lemma while tuning the value of e. We use without restating them 
the definitions given before the preceding lemma. 

has an inferior limit a < 1. Then, there ex- 



Let us assume that 



2|rfQ| 



|(Fio,Fio)i,2(^p)| 

ists e > such that j/+ and y~ (defined by (62)) belong to Q+ (for ||-Fio||l2 cos(a) 
large enough), then equation (65) implies that 



ll^-ioll 



7^ < e" 



N 
2tt 



and TZ tends to when ||-Fio|li, cos^(a) tends to infinity. 



If now 



2\do\ 



tends to a > 1, then y+ or y (given by (62)) belongs 



\{Flo,Fio} L2{-,c)\ 

to (for ||Fio||l2 cos(q;) large enough). And since in this case equation (64) 
leads to 

n>-(\ji{[0;\\Fio\\Lj2]) (69) 



+ 71 



0; 



4 V2 

|flo||L2 COs(q;) 

4sin(a) 



7i([0;||Fio||l2/4]) + 



27r 



we obtain the desired result by letting ||Fio||l2 tend to infinity. One has to 
observe that a depends on ||Fio||l2 and that the limit values a = 7r/2 and 
a = require the use of different terms in inequality (69). This ends the proof 
of Theorem 2.2. 



7.3. Proof of Lemma 7.1 



This proof is the central part of this section. It is mostly geometrical, and require 
only is the following four properties (given by Figure 6): 
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Property I Property 2 Property 3 Property 4 

Figure 6. The four properties used in the proof 

• Property 1. If A C between the two half straight Hnes (0, u) and (0, v) 
such that Angle(u,?;) = a, then 72 (^) = This result follows directly 
from rotational invariance of the gaussian measure. Such an area will be 
called an angular portion of size a and centre 0. 

• Properties 2 and 3. Let y €M? , Da. straight line of R^, h the orthogonal 
projection of on D and h the distance from y to D. li A <Z and 
A is included in the half plan delimited by D that does not contain y, 
then 72(A - y) < e~^'' /'^-f2{A - b). This is property 2. If A C W is 
included in the half plan delimited by D that contains y then 72 {A — y) > 
e"'' /^72(^ — &).This is property 3. 

• Property 4 . li A = [0;d] x [0;oo[ (see Figure 6) then 72(A) = i7i([0;d]). 
Such a rectangle will be called an infinite rectangle of origin and height 
d. 

We will note q and q the orthogonal projections oiy on D and D. The properties 
2 and 3 are well known but for the sake off completeness we recall their proof. 
It suffices to note that 



72(^-2/) 



ixeA 



2-K 



' dx 



2ti 
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and that x £ A implies (x— &)r2 < for property 2 and (x — 6,?/ — 6)r2 > 
for property 3. 

We are now going to distinguish between a number of cases and, in each of 
them, use the announced properties. First note that the inequahty concerning 
2/"*" is trivial. Figure 7 and 5 will be useful in the following. 

Case e Q^. In this case \yy\ < \u\. One can include in Q'L the disjoint 
union of an infinite rectangle of origin y~ , and height \yv\ ; an angular portion 
of size a and centre y~ ; and a rectangle with vertex y^ height \yv\/'2 and length 
|y^/2 ™^|"j |. Using properties 4 and 1, we then get: 

^7i([0;k|]) + ^+7i([0,f])7i( 

On the other hand, Q^L can be included in the disjoint union of an angular 
portioin with centre y~ , of two infinite rectangles with height less than or equal 
to |u| tan(Q!) and of two infinite rectangle of height lower or equal to Also, 
properties 1 and 4 imply: 

l2{Qt - y-) < £ + 7i([0; \u\{l + tan(«))]). (71) 

Case y^ E Q+. In this case \yv\ > (1 + e)|it|, is at a distance \yi,\ from D 
and at a distance {\yv\ ~ |u|)cos(q;) > -^^\yv\cos{a) from D. Properties 2 and 
3 imply: 

e-^72(0'_ -q)< 72 (Q- - y") < e~' '-i^Tj'^^lQ'i - q). (72) 



yv/2 



cos(a) 
sin(a) 



<l2iQt-y-). (70) 
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One can include in an angular portion of size a and with centre q or an 
infinite rectangle of origin y and height Also, properties 1 and 4 imply, with 
(72) and the fact that max(a, b) > the equation: 

^(^7i([0;H]) + ^) <72(g'L-g). 

The set Q'L can be included in the union of an angular portion of size a centred 
in q and of two infinite rectangles of origin q and height |u|(l + tan(Q;)). Also, 
properties 1 and 4 together with (72) and max(a, b) > ^-j^ imply the following 
equation: 

e-'^^^7i([0;|^|]) + ^) <72(Q'- -?;-), (73) 

72(g'- - y-) < ^^+»)^° (7i([0; \u\{l + tan(a))]) + . 

Case E Q^. In this case (1 + e)|u| > \yv\ > \u\, y^ is at a distance 
\yv\ < (1 + e)|u| from D and at a distance {\yv\ — |u|)cos(q!) > from D. 
Properties 2 and 3 imply 

e-^^^^72(Q'_ 'q)< i2iQt -y-)< 12{QI - q). (74) 

from which we deduce the following inequality in the same way as in the pre- 
ceding paragraph: 

e-''^^^^7i([0;|«|]) + ^) <72(Q'_-y-), (75) 

72(Q'- - y~) < (7i([0; 1^1(1 + tan(a))]) + . 
This ends the proof of the Lemma. 

Remark 7.1 (On log-concave measures). It is natural to ask which type of 
probability measure satisfies the four properties used. Concerning property 2, it 
is possible to consider measures that are not gaussian. Suppose that ^ is a prob- 
ability measure on W with positive density, ae~'^ with respect to the Lebesgue 
measure, where (f> is strictly convex in the sense that their exists c > such that 
for all x,y€RP 

cbix)+cj^{y)-2cb(^^'^>^\\x^y\\l, (76) 

(/)(0) = = Arginf 4>, a is a positive constant and (f> is radial: there exists a 
function ip from R to M such that (f>{x) = -0(11x11). Let y € W , D be a hyperplane 
of MP, b the orthogonal projection of y on D, h the distance from y to D and 
^ C included in the half space delimited by D which does not contain y. One 
can show (see proposition 3.3.1 pl26 in [L^>]) that 

li{A-y) < e-^'^/^(A-6). 
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7.4. Proof of Theorem 2.1 

Proof. The second equation of the Theorem results directly from equation (51) 
in Theorem 7.1. To show the first equation of the Theorem, we will four cases. 
Case number 4 is the important one that relies on the use of Theorem 7.1. The 
other cases rely on verifying that the right member of the first equation of the 
Theorem is not too small. 

1. Case where (Fio, -F'io)l2(7c) < ^■ 

Let us note that because 7?, is a probability, we have 7?. < 1. In addition, 

£ > \\Fio - i^io||L2(7c) > \\Pio\\l2{jc)- 
which implies that TZ^ < ire — ■ 

^ — ll-fiolUaCTc) 

2. Case where (Fio, -F'io)l2(7c) > ^ ^^'^^ I|Ao||l2(7c) ^ 5II-^io||l2(7c)- 
Recall that TZ is upper bounded by ^ when (Fio, -Fio)l2(7c) > ^ (^^^ 
Theorem 7.1, it is the case where a defined by (5) satisfies — 7r/2 < a < 
7r/2). 

In addition, the inequality ||Ao||l2(7c) - 5ll-^ioilL2(7c) implies 

£ > 2II^ioIIl2(7c)' 
and as a consequence TZp < ^ implies that TZp < np^^y^ ^ — -. 

3. Case where (Ao, ■F'io)l2(7c) > 0' I|AoIIl2(7c) ^ hW^whoJ-yc) ct f > a > 
J (recall that a has been defined by 5). 

Since ^ > a > j, we have cos(a) < ^ and as a consequence and with the 
help of (5): 

(^"10, -F'io)l2{7c) - — II-^1o||l2(7c)II-^1oIU2(7c)- 

Under this last constraint, we have 

min llFio - Flolli.f^^) = min ((1 - + a^) ||-Fio||i,(^^) = ll^io||i2(7c)' 

-Fio " 

which again implies TZp < 



P — Il-Fl0|ll,2(7c) ' 

4. Case where (Fio, ^10)^^(70) > 0, I1AoIIl2(7c) > ^ll-P'iojlL 2(70) ^'^'^ < f • 
Since a € [0, j], the concavity of the sin function gives 

a ^ sin(a) 

In addition, the relation ||-Fio||l2(7c) — ?II-^io||l2(7c) i™plics that 

HnF^-L/lo|U2(7c) ^ 2||j^io^Ao|U2(7c) 

sm(a) = < 

\\Fio\\L2hc) II-P^io||l2(7c) 
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(the first inequality is a trigonometric formula). Finally, we obtain: 

\\Fw - Fio||l2(7c) 



< 



^ V2||^io||l2(7c) 



(77) 



Recall that do — (-Fio,sio — sio)kp- The equality defining a (5) and the 
fact that cos(a) > now imply: 



Idol tan(a) 

\^f^„Fw\\l2{-ic 



< V2\do\ 



sin(Q;) 



- (smce cos(a) > -— ) 

\^F^„Fw\\L2hc) ^ 



V2|do| 



II^1o||l2(7c) 



(from a trigonometric formula). 



Also, noticing that 7i([0;u]) < and that tan(a) < 1, we get: 



71 



0; (1 + tan(a))- 



|do| tan(a) 
nFi^Q^iolU2(7c). 



< 



71 



2V2|do 



|-P'io||l2(7c) 



< 



2|do| 



(78) 



V^I|-f'lo||L2(7c) 

In the cases 1, 2 and 3 of Theorem 7.1, because tan(a) < 1 (a < f), the 
equations (77), (78), (51), (54) imply: 



n < 



£ 



10||L2(7c) 



This ends the proof of Theorem 2.1. 



□ 



8. A general scheme to solve Problem 1 
8.1. Introduction and main result 

Presentation of the main ideas. In this section, we will prove results con- 
cerning the QDA procedure. Recall that the learning error TZ (The probability 
to misclassify data with a given rule when the optimal rule gives a correct clas- 
siication) satisfies: 



n<l(Pi{x e v.Q Av^Q ) + Po{x e v.q av^q ) 



(79) 



(If / : rY ^ R, Vf is defined by (45) at the beginning of the preceding section). 
Indeed, the event X G Vaq AV^q corresponds to the case where decisions (good 

^10 ^10 

or erroneous) taken by the optimal rule and the plug-in rule are different. 
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Remark 8.1. In the case of procedure LDA, we had 

From this equation, one can easily deduce that 

27^ = \ {X e VAV - ^) + {X ^VAV+'-^)). 

and as a consequence: 

27^ = i (Pi (X G V^aAVc^^^ ) + P,iXe V^aAV^a^ )) . (80) 

It is less obvious that this type of relation is true in the "quadratic case. It's 
seems less obvious. 

In subsection 8.2 we will present a technique to put an upper bound on the 
probabilities like P{VfAVf+s)- In this type of quantity, we shall call pertur- 
bation function the measurable function d (which can be thought as a small 
function) and optimal frontier function the measurable function / from X to M. 
In the case of the QDA, the results obtained are consequences of Theorem 8.1 
given in the next paragraph, with frontier function / = C^q and perturbation 
function S = £?n — ^?n- 



A general result concerning quadratic perturbation of a quadratic 

rule. In the sequel we need to introduce some quantities related to gaussian 
measure in separable Banach spaces, and A" is a separable Banach Space. We 
refer to [S] and its section on measurable polynomials for a rigourous treatment 
of the subject. The Hilbcrt Space of measurable afBne function from X toM. with 
finite £2(70, 

771) norm a-iid null intcgra-l with, respect to ^c,m. 

will be denoted by 

A"*^ „, • T^^*^ Hilbert space of measurable quadratic form in ^2(70,771) with null 
integral with respect to 7c,m will be denoted i?2(7c.m)- The space of measurable 
quadratic forms in ^2(70.™) will be denoted by ^Yj^ and we have the classical 
gaussian chaos decomposition in L2{'jc,m)' 

x;^ - {cte} © x;^ ^^ © £^2(7^™). 

In infinite dimension i?(7c,m) is the reproducing kernel Hilbert space associ- 
ated to 7c,m, in finite dimension {X = K^), we have (if C is of full rank) 
H{'-fc,m) = R"^- Recall that to each Hilbert-Schmidt operator A on II{jc,m), 
one can associate the measurable element of £'2(70,7^.) and that each element 
of £^2(70,771) is associated to a unique Hilbert-Schmidt operator on H{'^c.m)- In 
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finite dimension, if C is of full rank: 

<lA''"{x)=qc-i/2Ac-U2{x~m)~ I gc"i/2AC-i/2(x - TO)7c.m(da;) 

J X 

( recall that qA{x) = {Ax,x)mp) 

p 

= {AC-^'^{x - to), C-i/2(^ _ _ ^ A„ 

1=1 

where (Aj;)j-i^....p is the vector of the eigenvalues of A. 

Theorem 8.1. Let X he a separable Banach space, 'yc,m, be a gaussian measure 
on X with mean m and covariance C. Let A and D be 2 symmetric Hilbert- 
Schmidt operators on H{'yc.m), F,d(£ '^j^ m ' '^^'^ ^ 

f{x) = c + F{x) + q^'^-"' {x) and 5{x) = do + d{x) + g^^'"' [x) 

he the function defining Vf and (If 9 '■ X ^ Vg is defined by equation 

(45)). Finally, let r,ReM. he such that R> r > 0. 

1. Assume that r < ||/|jL9(7c m) ■ Then, for all q g]0, 1[, there exists Ci(r, q) > 
( that only depends on r and R ) such that 

^c,m{Vf^Vf+,) < ci{r,q)\\5\\f_^^^^^y (81) 

2. // |Ei2(^p ^)[/]| > r and |1/|1l2(7c m)' ihen, for all q €]0, 1[, there exists 
C2 {r, q) > ( that only depends on r and R ) such that 

^c,m{Vft^Vf+s) < ^2{r,q)\\5t;;g^^^y (82) 

The two following subsections are devoted to the proof of this theorem. Sub- 
section 8.2 presents a general methodology to obtain this type of result, and in 
Section 8.4, we apply this methodology to obtain Theorem 8.1. 

8.2. Decomposition of the domain 

We will give an upper bound to the probability that X S V/AV/+5. In the cases 
we have in mind, this set is essentially composed of elements for which 8 takes 
large values or / is near zero. Also, we shall bound the measure of areas on 
which 

1. the perturbation is large (with large deviation inequality). 

2. I/I is small (with an inequality such as P(\f{X)\ < e) < 5(e)). 

Lemma 8.1 that follows is based on the two following assumptions. 

1. Assumption Ai. It exists co,ci > 0, hs : M+ iion decreasing such 

that ^15(0) = , lim^^oo hs{s) ~ 00 and 

Vs>0, P{\S{X)-E[S{X)]\>cahsis))<cie~'^. (83) 
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2. Assumption A2. It exists /3 > and C2 > such that 

Ve>0, P{\f{X)\<e)<C2e^. (84) 

Remark 8.2. The Junction hg of Assumption Ai will help us in measuring the 
effect of a perturbation 6. 

Lemma 8.1. Under Assumption Ai (83) and A2 (84), for all q e]0; 1[ we have: 
P{X e VfAVf+s) <c\^''c2\Ep[S{xW 



where ^ is a centred real gaussian random variable with variance 1 . 
Proof Recall that Vf ^ {x : f{x) > 0}. 

P{X e VfAVf+s) = 

P {^{S{X) - E[S{X)]) - E[S{X)] < fix) < 
or < f{X) < {d{X) - E[S{X)]) + E[6{X)]) , 

also, 

P{X e VfAVf+s) < P{U), 

where U = {\f{X)\ < - E[<5(X)]| + |E[<5(X)]|} . 

Define Bj = {cohs{j) < \S{X) - E[6{X)]\ < cohsij + 1)} for j e N. This family 
of events permits us to recover all possible events. 

We observe that 

P{u) = Y.PiunB,), 

and then using the Holder inequality, ( p + q = 1) we get: 

P{u)<Y,PiunB,)^PiB,)'^. 

It follows that 

PiX e VfAVf+s) 

< ^ P {\f{X)\ < \E[6{X)]\ + cohsU + 1))' P mX) - E[S{X)]\ > cohsij))'-" 

j 

< C2c}-'?^(|E[5(X)]| +co/j,-(j + l))^''e-^, 

j>0 

( from assumption Al and A2 ) 
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27r 



1 - 9 7o 

which implies the desired result 



{hs{x + 1) + |E[,5(X)]|)''' Ji_^e-^^^rfx 



27r 



□ 



Lemma 8.2. Let Si, . . . ,Sk be k perturbations satisfying assumption Ai defined 
by equation (83) with the error functions hs^ , • ■ • , hg^ . Then, if kg = X]i=i ^Si , 
there exists co(fc),ci(fc) > such that 



Vs > P{\6~ E{6)\ > cohsis)) < ci 



e 2 



(85) 



Proof. Recall that for all i, hs^ > 0. Let us fix s > 0. The proof relies on 
the pigeonhole principle. Indeed, if J2i=i ~ ^ kJ2i=i'^oihsi{s) then 

there exists zq e {1, . . . , fc} such that \Sig — E[i5io]| — J2i=i coihsi{s). If we fix 
Co = fcmaxcoi, we then have 



i=l 



j=l / \i=l i=l ) 

( from the triangle inequality and the fact that 
fc fc 

Co ^ ^5, (s) > ^ C0i/i<5i (s) ) 

<p|^3ioe {!,..., fc} : |5,o-E[5,„]| >^co,/i5,(s)j 

(pigeon hole principle) 

fc 

(subadditivity of probability) 



< 



E 



(/15. satisfies assumption Ai), 



which ends the proof. 



□ 



The results that allow us to verify assumption A2 are presented in Section 8.5. 
We now recall some standard large deviation results that allow us to verify 
assumption Al. 

8.3. Large deviation 

In the case where 8 is linear or Lipschits, the following classical result (see for 
example [8] (pl74)) allows us to check assumption A\. 
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Theorem 8.2. Let j = be a gaussian measure of covariance C on X a 
separable Banach Space, H = Hi"f) be the associated reproducing kernel Hilbert 
Space, S : X ^ W a function such that there exists N(S) > with 

\S{x + h)-S{x)\<N{6)\h\H(^) yheH{-f) -f-ps. (86) 

Then 



Vs > 7 X e A- : \S{x) - / S{x)d'y\ > s < 2e "^^^ (87) 



(88) 



In the case where 6 is quadratic, the following result from Massart and Lau- 
rent [19] (Lemma 1 pl325 ) will help us to check assumption Ai. 

Theorem 8.3. If D ^ Diag{di, . . . ,dp) and qnix) = {Dx,x)rp, then 
jpixeW : qoix)- / to(a;)7p(da;) > ^|lg£,||i2(7p) +sup|di|s^ ) < e' 

7p {x(^W ■ qoix) - qoixhpidx) < -^hoh^hp)^ < e"^ 
As a consequence, assumption Ai is satisfied with hs{s) ~ §119-01^2(7?) + 

s2 SUpJd^l) < ||g^||^^(^^)(| +s2). 

The use we will make of these results is entirely contained in the following 
corollary. 

Corollary 8.1. Let X be a separable Banach space, 7 a gaussian measure on X 
and 5 e £^2(7). Then 5 satisfies assumption Al withhs{s) = \\S — K-y[S]\\i^^(^y^(s + 

Proof. It suffices to check the result for X = M.p and to use a standard approx- 
imation argument. Recall that in £2(7), we have '^'2,7 = {cte} ® A"* 0i?2(7). 
Also, there exists a unique triplet 5q = E,y[S] e {cte}, 61 £ X* and S2 G -£^2(7) 
such that S — So-\-Si+S2- From the preceding corollary, assumption Ai is satis- 
fied for perturbation S2, measure P = 7 and hs2{s) = ||52||l2(7)('5 + '*^)- Because 
61 e X*, Si is affine. Also, by Theorem 8.2, the assumption Ai is satisfied for 
perturbation 5i with hs-^{s) = s||(5i||/^2(-y). We can then conclude using Lemma 
8.2 and the fact that 

\ML,iy)is + S') + s||'5l|U,(7) < (||(5i||l2(7) + 1 1 '^2 1 1 (7) ) + 

< V2is + s^)\\S - Soh.i^y 

□ 

Wc now have all elements to demonstrate Theorem 8.1. 



8.4. Proof of Theorem 8.1 

As announced, we shall apply Theorem 8.1. From Theorem 8.4 Assumption A2 
is satisfied with /3 = 1/3 in the case 1 of our Theorem and for /3 = 2/7 in the 
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case 2 of our Theorem. In both cases the constant C2 depends on r only. In both 
cases, from the preceding corollary, assumption A2 is satisfied with the function 
hs{s) = (s + s^)||(5 — i5o||l2(7)- Also, if we apply Lemma 8.1, for all q €]0, 1[, 
there exists a constant C(r, g) > such that 

liVf^Vf+s) < C(r, q) {\E^iS)\ + \\5- E[6]h.^^)y^ , 
and a constant C (r, q) > such that 

j{VfAVf+s)<C'{r,q)\\d\\f^^^^, 
This ends the proof of the Theorem. 



8.5. Small crown probability 

In this subsection X2 is the set of real random variables that can be written 
c + E«>i - 1) + ai^i with c e M, /3 = (/30z S ^2(N), a = (a,). € /'(iV) 
(6:)iGN is a sequence of independent identically distributed gaussian random 
variables with mean and variance 1. Let q G A'2 given by 



j>0 



we will note 



7T.i((7) = max|a,| 712(9) = max |/3j|, (^(q) ^ I'^'^P'! + \ ■ (90) 

V.>o J 

Theorem 8.4. 1. There exists C(co) > such that 

sup{P(|g|<e) : qex; : |E[g] | > cq } < C(co)e'/^ 

2. There exists C'{cq) > such that 

sup{P(|<7| <e) : qeX; : E[q^] > Cq } < C'(co)ei/^ 

3. Let q G X2, for all e > 0, 



1 



TT n2{q) 



Remark 8.3. This result may seem surprising, and we did not show it is opti- 
mal. If 712 (q) = maxi \(3i\ > cq, the bound of point 3 is optimal in the sense that 
if f}= (1,0,...), c = 1 and a = we get P{\q\ < e) = P(|C^| < e) ~ Ce^/^ f/or 
a constant C which can he calculated explicitly). In addition, when ||/3||;2 — > 
the behaviour of P{\q\ < e) tends to be the same as i-'(|||Q;||;2A/'(0, 1) — c| < e) ~ 
C"(co)e. Also, it may be conjectured that points 1 and 2 of the Theorem can be 
improved (in order to obtain an exponent 1/2 instead of 2/7 and l/3j but we 
believe this is unlikely. The difficult cases to study ( and point 3 of the following 
proof demonstrate this) are those with ||/3||oo — *■ but ||/3|ji2 does not tend to 
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Proof. Wc shall proceed in four steps. 
Step 1. We claim that if |E[9]| > e then 

P{\<1\<^)< 



(91) 



Notice that \q-E[q] \ > ||g|-|E[g]|| and if |g| <e< \E\q] \ then ||g|-|E[g]|| = 
\E[q]\-\q\ and 

M>|E[g]|-|g-E[g]|. 

Also 

Pi\q\ <e)< P(|EM| -\q- E[q]\ < e) ^ P{1 < ^|^) 

which implies (91) by the Markov inequality. 

Step 2. We will assume without loss of generality that for alH € N 0;^ > 0. 
This is what we will do. In the following, ai„ = max^ai, jo £ arg max |/3j| and 
sign(a;) is the function that returns the sign of the real x. We claim that 



^(kl < e) < 



1 e 



nn2{q)' 



(92) 



Let 



To obtain the desired inequality, note that for all aj„ > 0, /3jg ^ 
P {\Z + + Pjaie -l)\<e)=P (I sign(/3,jZ + a,,^ + \l3,„\{e - l)\ < e) 



= P 



sign(/3jjZ 



+ + 



■'JO \2 



a," 



r-1- 



4/3: 



< 



30 



1/3., 



where 



f /n _L sign(/3)Z-£ 

-W(l + ^ ^ )+, 



and (x)+ = xlxX). The inequality (92) results from the choice a = aj^ and 
and from the fact that if u e 1 



^2(9) ■ 



5tep 3 We claim that 



^(kl < e) < 208 



2e 



?^2(g) ^ ^ 

^(g) ^ cr(g)' 



(93) 



We prove the following lemma (which is a central limit theorem) at the end of 
the proof. 
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Lemma 8.3. Let = I3i{£,f - 1) + a^Ci, C be a gaussian centered random 
variable with variance 1 and a{q) given by (90). We obtain: 



sup 

e>0 



P||E,M+^X,|<.|-P(|^+^|< ' 



a{q) cr((?) 



max(|/3i|) 
< 104 



Also, because \&[q] \ > e 



o-(g) <j{q)J criq) 

we have inequality (93). 

Step 4- As announced we will distinguish several disjoint cases to demonstrate 
points 1 and 2 of the theorem. We begin with point 1. 

1. In the case where a{q) < e^^^, it is the inequality from step 1 (91) that 
leads to the desired conclusion. 

2. In the case where n2{q) > e^^"^, it is the inequality from step 2 (92) that 
leads to the desired conclusion. 

3. In the case where n2{q) < C'l'^ and aici) > e^/'', it is the inequality from 
step 3 (93) that leads to the desired conclusion. 

We conclude with point 2. 

1. In the case where n2{q) > e^/'^, it is the inequality from step 2 (92) that 
leads to the desired conclusion. 

2. In the case where n2{q) < e^^^ it is the inequality from step 3 (93) that 
leads to the desired conclusion. 

□ 

We now give the proof of theorem 8.3. 
Proof. This proof is decomposed into two steps. In the first step, we calculate 

Va, 13 gR, <t>c.4){t) = E pt(C"+/3(C'-i))l ^ (94) 
and in the second one we deduce that for all It I < 



6 max J 1 13 j \ 
3 

cr 2 



I n ^.^M^M - e-V^I < 4ma5^^^_,./e^ ^^^^ 
which implies the desired result from the Essen inequality (see for example [23] 
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p358) 



sup 



p{\Y. + ml - 1) > - 



< 



24 



av 27r 



< 4 max, 1/3,1 t^^-ii^^ ^ max, 1/3, 1 72^2 



max,- 1/3. 



^ ' 72a/- + 32 I < 104 



crVTT 
max,- 1/3," I 



where $ is the cumulative distribution function of a standardised gaussian real 
random variable. 

Step 1. Let = {z e C 23(z)/3 > —1} and ipa.i3{z) be given by 



-Piz 



(1 2/3iz)i/2 



The function ■0a. /3 is analytic on fi^. The function (pa. pit) defined by (94) can 
be continued into an analytic function on the domain flfj and because 



2 ''2^ ' 1 + 2/3?/' 2(1 + 2^?/) 

we observe that ^ 

> 1pa,0{iy) = (l)a.,(}{iy)- 

Also, we can deduce that 4'a,f3{z) and ipa.piz) are equal on fi^ and in particular 
on M which gives 

Va,/?eM, t€R ^aAt)^ (1-2^)^/" "''^'™^' 
5iep ^. Proof of (95). The preceding equation gives 

I n ^o.At/<^) - = e"^ - 1| < e""^ |^|e^ 

where 

u=i et -y + Ej-l/2(Y^^ + ^(-2/3.--log(l-2/3,u^^^ 
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and hence 

j>0 I ^ 



u 1 



2 (1 - 2(3jiu) 



u^2p^ 1 ' 
^^--(2/?,m + log(l-2/?,m)) 



In addition, if \t\ < 



6 max^ \f3i \ 



(96) 

, then for all j £ N \2u(3j\ < i and we have (cf 



Taylor expansion (1) p352 in [23] ) 



I log(l - 2/3jw) + 2f3jui - 



1 



\-\2ul5,\ 



< 4|'u/3j-|^max|/3j| 
j 



We also have 



u'^a'j 1 a^u^ 



1 



As a consequence, if |i| < gn^^x- | ' then (96) implies: 

1^1 < 2a2|ii|3max|/3,| = ^J^^^^±lM\t\^ ^ 



and 



□ 
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